Monday, July 27, 2015

Discovering Entity Relationships in Sherlock Holmes


Sherlock Holmes short stories, written by Sir Arthur Conan Doyle used to be quite popular back when I was a teenager. The stories are a collection that describes the exploits of Sherlock Holmes, as narrated by his friend and sidekick Dr Watson. Not only do the stories share their central characters, they also have characters dropping in and out in different stories, much like a long running TV series. Recently I found some of these stories on Project Gutenberg, so I thought it might be interesting to write some code to discover relationships among the various characters in the stories. This post describes what I did. For my set of stories, I chose the following stories.

  • pg108.txt: The Return of Sherlock Holmes
  • pg1661.txt: The Adventures of Sherlock Holmes
  • pg2097.txt: The Sign of the Four
  • pg2343.txt: The Adventure of Wisteria Lodge
  • pg2344.txt: The Adventure of the Cardboard Box
  • pg2345.txt: The Adventure of the Red Circle
  • pg2346.txt: The Adventure of the Bruce-Partington Plans
  • pg2347.txt: The Adventure of the Dying Detective
  • pg2348.txt: The Disappearance of Lady Frances Carfax
  • pg2349.txt: The Adventure of the Devil's Foot
  • pg2350.txt: His Last Bow
  • pg244.txt: A Study In Scarlet
  • pg2852.txt: The Hound of the Baskervilles
  • pg3289.txt: The Valley of Fear
  • pg834.txt: Memoirs of Sherlock Holmes

Each of these files are available as plain text. The first step is to remove the Gutenberg Project headers and footers, then convert the remaining text to paragraphs. From that point on, I used the Stanford CoreNLP library to convert the paragraphs to sentences, then extract named entities from these sentences. Since I was interested in identifying the relationships between characters, I focused on the PERSON entities only. The next step is to disambiguate the entities, ie, to determine that "Mr. Holmes", "Holmes" and "Sherlock Holmes" all refer to the same person. Finally, I compute co-occurrence across various blocks and compute a weighted co-occurrence score between the various entities and build a graph (for the top entities) for visualization.

The driver code is shown below. It calls out to various components which are described in more detail below.

1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
// Source: src/main/scala/com/mycompany/scalcium/sherlock/SherlockMain.scala
package com.mycompany.scalcium.sherlock

import java.io.File
import java.io.FileWriter
import java.io.PrintWriter

import scala.Array.canBuildFrom
import scala.collection.mutable.ArrayBuffer
import scala.io.Source

import org.apache.commons.io.FileUtils

object SherlockMain extends App {

    val dataDir = new File("data/sherlock/")
    
    val metadataRemover = new MetadataRemover()
    val paragraphSplitter = new ParagraphSplitter()
    val sentenceSplitter = new SentenceSplitter()
    val nerFinder = new NERFinder()
    val entityDisambiguator = new EntityDisambiguator()
    val graphDataGenerator = new GraphDataGenerator()
    
    preprocess(new File(dataDir, "gutenberg"), 
        new File(dataDir, "output.tsv"))
    wordCounts(new File(dataDir, "gutenberg"))
    disambiguatePersons(new File(dataDir, "output.tsv"), 
        new File(dataDir, "output-PER-disambig.tsv"))
    generateGraphData(new File(dataDir, "output-PER-disambig.tsv"),
        new File(dataDir, "PER-graph-verts.tsv"),
        new File(dataDir, "PER-graph-edges.tsv"))
        
    
    def preprocess(indir: File, outfile: File): Unit = {
        val writer = new PrintWriter(new FileWriter(outfile), true)
        val gutenbergFiles = indir.listFiles()
        gutenbergFiles.zipWithIndex.map(tf => {
                val text = FileUtils.readFileToString(tf._1, "UTF-8")
                (tf._2, text)
            })
            .map(ft => (ft._1, metadataRemover.removeMetadata(ft._2)))
            .flatMap(ft => paragraphSplitter.split(ft))
            .flatMap(fpt => sentenceSplitter.split(fpt, false))
            .flatMap(fpst => nerFinder.find(fpst))
            .foreach(fpstt => save(writer, fpstt.productIterator))
        writer.flush()
        writer.close()
    }
        
    def save(outfile: PrintWriter, line: Iterator[_]): Unit =
        outfile.println(line.toList.mkString("\t"))

    def wordCounts(indir: File): Unit = {
        val gutenbergFiles = indir.listFiles()
        val numWordsInFile = ArrayBuffer[Int]()
        val numWordsInPara = ArrayBuffer[Int]()
        val numWordsInSent = ArrayBuffer[Int]()
        gutenbergFiles.zipWithIndex.foreach(tf => {
            val text = metadataRemover.removeMetadata(
                FileUtils.readFileToString(tf._1, "UTF-8"))
            numWordsInFile += text.split(" ").size
            val paras = paragraphSplitter.split((tf._2, text))
            paras.foreach(para => {
                numWordsInPara += para._3.split(" ").size
                val sents = sentenceSplitter.split(para, false)
                sents.foreach(sent => 
                    numWordsInSent += sent._4.split(" ").size)
            })
        })
        Console.println("Average words/file: %.3f".format(
            mean(numWordsInFile.toArray)))                
        Console.println("Average words/para: %.3f".format(
            mean(numWordsInPara.toArray)))
        Console.println("Average words/sent: %.3f".format(
            mean(numWordsInSent.toArray)))
    }
    
    def mean(xs: Array[Int]): Double = 
        xs.foldLeft(0)(_ + _).toDouble / xs.size

    def disambiguatePersons(infile: File, outfile: File): Unit = {
        val personEntities = entityDisambiguator.filterEntities(
            new File(dataDir, "output.tsv"), "PERSON")
        val uniquePersonEntities = personEntities.distinct
        val personSims = entityDisambiguator.similarities(uniquePersonEntities)
        val personSyns = entityDisambiguator.synonyms(personSims, 0.6) ++
            // add few well-known ones that escaped the 0.6 dragnet
            Map(("Holmes", "Sherlock Holmes"),
                ("MR. HOLMES", "Sherlock Holmes"),
                ("MR. SHERLOCK HOLMES", "Sherlock Holmes"),
                ("Mycroft", "Mycroft Holmes"),
                ("Brother Mycroft", "Mycroft Holmes"))
        val writer = new PrintWriter(new FileWriter(outfile), true)
        Source.fromFile(infile).getLines
              .map(line => line.split("\t"))
              .filter(cols => cols(4).equals("PERSON"))
              .map(cols => (cols(0), cols(1), cols(2), 
                  personSyns.getOrElse(cols(3), cols(3))))
              .foreach(cs => writer.println("%d\t%d\t%d\t%s".format(
                  cs._1.toInt, cs._2.toInt, cs._3.toInt, cs._4)))
        writer.flush()
        writer.close()
    }
    
    def generateGraphData(infile: File, vfile: File, efile: File): Unit = {
        val vwriter = new PrintWriter(new FileWriter(vfile), true)
        val vertices = graphDataGenerator.vertices(infile, 0)
        vertices.toList.sortWith((a, b) => a._2 > b._2)
            .foreach(vert => vwriter.println("%s\t%d".format(
                vert._1, vert._2)))
        vwriter.flush()
        vwriter.close()
        // consider minFreq = 32 (top 20), create exclude Set
        val excludeVertices = vertices.filter(vert => vert._2 < 32)
                                      .map(vert => vert._1)
                                      .toSet
        val ewriter = new PrintWriter(new FileWriter(efile), true)
        val edges = graphDataGenerator.edges(infile, 0.1D, excludeVertices)
        edges.foreach(edge => ewriter.println("%s\t%s\t%.3f".format(
            edge._1, edge._2, edge._3)))
        ewriter.flush()
        ewriter.close()
    }
}

Preprocessing


This phase consists of multiple operations. Each operation has an associated class. I had initially set this up as each step reading the output of the previous step and writing its output. Later I modeled it as a pipeline so it can be readily adapted into a Spark pipeline.

The first step is to remove the Gutenberg header and footer blocks from the files, since we don't want to extract entities from that portion of the text. We also apply some heuristics to remove paragraphs which can be easily identified as being outside the story.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
// Source: src/main/scala/com/mycompany/scalcium/sherlock/MetadataRemover.scala
package com.mycompany.scalcium.sherlock

class MetadataRemover {
    
    def removeMetadata(text: String): String = {
        val lines = text.split("\n")
        val bounds = lines.zipWithIndex
                          .filter(li => 
                              li._1.startsWith("*** START OF THIS PROJECT") ||
                              li._1.startsWith("*** END OF THIS PROJECT"))
                          .map(li => li._2)
                          .toList
        lines.slice(bounds(0) + 1, bounds(1))
             .filter(line => looksLikeStoryLine(line))
             .mkString("\n")
    }
    
    def looksLikeStoryLine(line: String): Boolean = {
        (!line.startsWith("Produced by") && 
            !line.startsWith("By") &&
            !line.startsWith("by") &&
            !line.startsWith("End of the Project") &&
            !line.startsWith("End of Project") &&
            !line.contains("Arthur Conan Doyle") &&
            !(line.trim.size > 0 && line.toUpperCase().equals(line)))
    }

}

The input at this stage consists of lines that are terminated at around the 80-th character, with one sentence potentially split across multiple lines. Paragraphs are separated by multiple newlines. Our code converts this into paragraphs of text, where each paragraph consists of multiple sentences in a single line. We also eliminate paragraphs that are shorter than 40 characters, as these appear to be titles or section headings.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
// Source: src/main/scala/com/mycompany/scalcium/sherlock/ParagraphSplitter.scala
package com.mycompany.scalcium.sherlock

import scala.Array.canBuildFrom
import scala.collection.mutable.Stack

class ParagraphSplitter {
    
    def split(fileText: (Int, String)): List[(Int, Int, String)] = {
        val filename = fileText._1
        val lines = fileText._2.split("\n")
        val nonEmptyLineNbrPairs = lines
            .zipWithIndex
            .filter(li => li._1.trim.size > 0)
            .map(li => li._2)
            .sliding(2)
            .map(poss => (poss(0), poss(1)))
            .toList
        val spans = Stack[(Int,Int)]()
        var inSpan = false
        nonEmptyLineNbrPairs.foreach(pair => {
            if (spans.isEmpty) {
                if (pair._1 + 1 == pair._2) {
                    spans.push(pair)
                    inSpan = true
                } else {
                    spans.push((pair._1, pair._1))
                    spans.push((pair._2, pair._2))
                }
            } else {
                val last = spans.pop
                if (pair._1 + 1 == pair._2) {
                    spans.push((last._1, pair._2))
                    inSpan = true
                } else {
                    if (inSpan) {
                        spans.push((last._1, pair._1 + 1))
                        spans.push((pair._2, pair._2))
                        inSpan = false
                    } else {
                        spans.push(last)
                        spans.push((pair._2, pair._2))
                    }
                }
            }
        })
        val lastSpan = spans.pop
        spans.push((lastSpan._1, lastSpan._2 + 1))
        
        // extract paragraphs in order and add extra sequence info
        spans.reverse.toList
             .map(span => {
                  if (span._1 == span._2) lines(span._1)
                  else lines.slice(span._1, span._2).mkString(" ")
             })
             .filter(line => !line.startsWith(" ") &&
                             line.length > 40) // remove TOCs and titles
            .zipWithIndex
            .map(paraIdx => (filename, paraIdx._2, paraIdx._1))
    }
}

Once we have paragraphs, we use the Stanford CoreNLP library to split it up into sentences. Along with the sentences themselves, we also record the file number, paragraph number and sentence number. We will need this for calculating the similarities between entities later.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// Source: src/main/scala/com/mycompany/scalcium/sherlock/SentenceSplitter.scala
package com.mycompany.scalcium.sherlock

import java.util.Properties

import scala.collection.JavaConversions.asScalaBuffer
import scala.collection.JavaConversions.propertiesAsScalaMap

import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation
import edu.stanford.nlp.pipeline.Annotation
import edu.stanford.nlp.pipeline.StanfordCoreNLP

class SentenceSplitter {
    
    val props = new Properties()
    props("annotators") = "tokenize, ssplit"
    val pipeline = new StanfordCoreNLP(props)
    
    def split(fileParaText: (Int, Int, String), doPadding: Boolean): 
            List[(Int, Int, Int, String)] = {
        val fileId = fileParaText._1
        val paraId = fileParaText._2
        val paraText = if (!doPadding) fileParaText._3
                       else "<START>. " + fileParaText._3
        val annot = new Annotation(paraText)
        pipeline.annotate(annot)
        annot.get(classOf[SentencesAnnotation])
             .map(sentence => sentence.toString())
             .zipWithIndex
             .map(sentWithId => 
                 (fileId, paraId, sentWithId._2, sentWithId._1))
             .toList
    }
}

Finally, we use Stanford CoreNLP again to extract named entities from the sentences. It can extract around 7 classes of entities, but we are only interested in the PERSON entity class.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
// Source: src/main/scala/com/mycompany/scalcium/sherlock/NERFinder.scala
package com.mycompany.scalcium.sherlock

import java.util.Properties

import scala.collection.JavaConversions.asScalaBuffer
import scala.collection.JavaConversions.propertiesAsScalaMap
import scala.collection.mutable.Stack

import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation
import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation
import edu.stanford.nlp.pipeline.Annotation
import edu.stanford.nlp.pipeline.StanfordCoreNLP

class NERFinder {

    val props = new Properties()
    props("annotators") = "tokenize, ssplit, pos, lemma, ner"
    props("ssplit.isOneSentence") = "true"
    val pipeline = new StanfordCoreNLP(props)

    def find(sent: (Int, Int, Int, String)): 
            List[(Int, Int, Int, String, String)] = {
        val fileId = sent._1
        val paraId = sent._2
        val sentId = sent._3
        val sentText = sent._4
        val annot = new Annotation(sentText)
        pipeline.annotate(annot)
        val tokTags = annot.get(classOf[SentencesAnnotation])
                           .head // only one sentence in input
                           .get(classOf[TokensAnnotation])
                           .map(token => {
                                val begin = token.beginPosition()
                                val end = token.endPosition()
                                val nerToken = sentText.substring(begin, end)
                                val nerTag = token.ner()
                                (nerToken, nerTag)
                           })
        // consolidate NER for multiple tokens into one, so
        // for example: Ronald/PERSON Adair/PERSON becomes 
        // "Ronald Adair"/PERSON
        val spans = Stack[(String, String)]()
        tokTags.foreach(tokTag => {
            if (spans.isEmpty) spans.push(tokTag)
            else if (tokTag._2.equals("O")) spans.push(tokTag)
            else {
                val prevEntry = spans.pop
                if (prevEntry._2.equals(tokTag._2))
                    spans.push((Array(prevEntry._1, tokTag._1).mkString(" "), 
                        tokTag._2))
                else {
                    spans.push(prevEntry)
                    spans.push(tokTag)
                }
            }
        })
        spans.reverse
             .filter(tokTag => !tokTag._2.equals("O"))
             .map(tokTag => (fileId, paraId, sentId, tokTag._1, tokTag._2))
             .toList
    }
}

The output of this pipeline is a single tab separated file containing 17,437 entities that looks like this. The first three columns are the file number, the paragraph number in the file, and the sentence number in the paragraph respectively. The next column is the span of text representing the entity and the last column is the entity type that CoreNLP assigned to this text.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
0    0    0    the spring of the year 1894    DATE
0    0    0    London    LOCATION
0    0    0    Ronald Adair    PERSON
0    0    2    now    DATE
0    0    2    the end of nearly ten years    DURATION
0    0    4    now    DATE
0    0    4    once    DATE
0    0    5    first    ORDINAL
0    0    5    third    ORDINAL
0    0    5    last month    DATE
...

Entity Disambiguation


Since my objective is to figure out the relationships between characters in the stories, I focus only on the PERSON entities going forward. However, even among the PERSON entities, there is potential for ambiguity - for example, "Sherlock Holmes" can be variously refered to as "Sherlock", "Holmes", "Mr. Holmes", etc. So I needed a way to "normalize" these mentions to a single entity name.

My disambiguation code first tries to compute the Levenshtein's distance between a pair of entity names. If the distance is 0 it means they are identical and nothing more needs to be done. If they are not identical, I calculate a measure of word overlap between the two strings - the word intersection size divided by the average number of words in the two strings, and consider the two entities equal if the score exceeds a certain threshold (in my case 0.6, chosen by quickly eyeballing the generated pairwise scores).

This list of entity pairs is converted into a dictionary of synonyms where the shorter entity points to the longer entity. I then pass the file generated above through this filter. Any text span that can be found in this dictionary is transformed into the longer entity, otherwise it is kept as-is. The end product isa file of PERSON entities with the entity names mostly disambiguated. I did notice some that were major and were not addressed by the code above, so I added a set of manual overrides as well. The original list of 6,428 PERSON entities came down to 701 entities after disambiguation. Here is the code for the disambiguator.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
// Source: src/main/scala/com/mycompany/scalcium/sherlock/EntityDisambiguator.scala
package com.mycompany.scalcium.sherlock

import java.io.File
import scala.io.Source
import org.apache.commons.lang3.StringUtils

class EntityDisambiguator {

    def filterEntities(infile: File, entityType: String): List[String] = {
        Source.fromFile(infile).getLines
              .filter(line => line.split("\t")(4).equals(entityType))
              .map(line => line.split("\t")(3))
              .toList
    }
    
    def similarities(uniqueEntities: List[String]): 
            List[(String, String, Double)] = {
        val entityPairs = for (e1 <- uniqueEntities; 
                               e2 <- uniqueEntities)
                          yield (e1, e2)
        // the filter removes exact duplicates, eg (Holmes, Holmes)
        // as well as makes the LHS shorter than RHS, eg (Holmes,
        // Sherlock Holmes)
        entityPairs.filter(ee => ee._1.length < ee._2.length)
          .map(ee => (ee._1, ee._2, similarity(ee._1, ee._2)))
    }
    
    def similarity(entityPair: (String, String)): Double = {
        val levenshtein = StringUtils.getLevenshteinDistance(
            entityPair._1, entityPair._2)
        if (levenshtein == 0.0D) 1.0D
        else {
            val words1 = entityPair._1.split(" ").toSet
            val words2 = entityPair._2.split(" ").toSet
            words1.intersect(words2).size.toDouble / 
                ((words1.size + words2.size) * 0.5)
        }
    }
    
    def synonyms(sims: List[(String, String, Double)], 
            minSim: Double): Map[String, String] = {
        sims.filter(ees => (!ees._1.equals(ees._2) && ees._3 > minSim))
            // remove duplicate mappings, eg Peter => Peter Carey and
            // Peter => Peter Jones. Like the unique cases which have 
            // no suspected candidates, these will resolve to themselves
            .groupBy(ee => ee._1)
            .filter(eeg => eeg._2.size == 1)
            .map(eeg => (eeg._1, eeg._2.head._2))
            .toMap
    }
}

Scoring


I then took pairwise sets of the entities (along with their "co-ordinates" given by the first 3 columns in the file - the file ID, paragraph ID and sentence ID), and compute pairwise distances between the entities. The final representation of these entities is as a graph, with edges representing connections between entities and edge weights representing the degree of connectedness. Each line in the input file represents an instance of one of the entities.

The pairwise distance is computed based on the three coordinates. If two entity instances co-occur in the same sentence, they contribute the highest to the edge weight between the two entities. Less so if they co-occur in the same paragraph and even less if they occur in the same file. The actual weights are the reciprocals of the average sentence length, average paragraph length and the average file length in words. The intuition here is that it is more likely to have a pair of entity instances co-occur in the same paragraph than the same sentence, and the likelihood depends on the number of words being considered.

Just like term frequencies, entity frequencies also seem to behave in a Zipfian manner. So it makes sense to discard entities with low frequencies since they don't model anything interesting. Also, removing them at the outset results in fewer cycles when we compute their pairwise weights. In our case we compute frequencies for all entities, but only consider entities whose frequencies are higher than 32 (leaving us with around 20 top entities in our entity relationship graph). Here is the code to generate the entity frequency and the entity relation data.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
// Source: src/main/scala/com/mycompany/scalcium/sherlock/GraphDataGenerator.scala
package com.mycompany.scalcium.sherlock

import java.io.File
import scala.io.Source
import java.io.PrintWriter
import java.io.FileWriter

class GraphDataGenerator {

    def pFile = 1.0D / 30579
    def pPara = 1.0D / 56.534
    def pSent = 1.0D / 15.629
    
    def vertices(infile: File, minFreq: Int = 0): Map[String, Int] = {
        Source.fromFile(infile).getLines
              .map(line => line.split("\t")(3))
              .toList
              .groupBy(name => name)
              .map(ng => (ng._1, ng._2.size))
              .filter(ns => ns._2 > minFreq)
    }
    
    def edges(infile: File, minSim: Double = 0.01D, 
            exclude: Set[String] = Set()): 
            List[(String,String,Double)] = {
        val personData = Source.fromFile(infile).getLines
                               .map(line => line.split("\t"))
                               .filter(cols => !exclude.contains(cols(3)))
                               .toList
        val personPairs = for (p1 <- personData; p2 <- personData) 
                          yield (p1, p2)
        personPairs.filter(pp => pp._1(3).length < pp._2(3).length)
            .map(pp => (pp._1(3), pp._2(3), edgeWeight(pp._1, pp._2)))
            .toList
            .groupBy(ppw => (ppw._1, ppw._2))
            .map(ppwg => (ppwg._1._1, ppwg._1._2, ppwg._2.map(_._3).sum))
            .filter(ppwg => ppwg._3 > minSim)
            .toList
    }
    
    def edgeWeight(p1: Array[_], p2: Array[_]): Double = {
        val p1Locs = p1.slice(0, 3).map(_.asInstanceOf[String].toInt)
        val p2Locs = p2.slice(0, 3).map(_.asInstanceOf[String].toInt)
        if (p1Locs(0) == p2Locs(0)) { // same file
            if (p1Locs(1) == p2Locs(1)) { // same para
                if (p1Locs(2) == p2Locs(2)) pSent // same sentence
                else pPara // same para but not same sentence 
            } else pFile // same file but not same para
        } else 0.0D // not same file
    }
}

This outputs two files, one with entities and their frequency in the corpus, and another with pairs of entities and their edge weights. They are shown below:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# PER-graph-verts.tsv
Sherlock Holmes    1818
Watson             609
Messrs. Lestrade   200
Henry              129
Miss Stapleton     91
Mortimer           86
Tobias Gregson     66
Charles            66
Barrymore          64
Hopkins            41
...

# PER-graph-edges.tsv
Mary         Sherlock Holmes    0.442
Jones        Sherlock Holmes    0.583
Barrymore    Sherlock Holmes    0.603
Hopkins      Sherlock Holmes    1.688
Drebber      Jefferson Hope     0.216
Hopkins      Messrs. Lestrade   0.111
Watson       Charles            0.257
Henry        Watson             0.633
Watson       Sherlock Holmes    10.432
James        Sherlock Holmes    0.643
...

Visualizations


We use the files generated in the previous step to create a frequency chart for the top 20 entities and a graph showing the relationships between these entities. Code (in Python) to generate the frequency distribution for the top 20 entities is shown below:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Source: src/main/python/sherlock/per_freqs.py
# -*- coding: utf-8 -*-
import matplotlib.pyplot as plt

fin = open("data/sherlock/PER-graph-verts.tsv", 'rb')
xs = []
ys = []
num_entries = 20
curr_entries = 0
for line in fin:
    cols = line.strip().split("\t")
    xs.append(cols[0])
    ys.append(int(cols[1]))
    if curr_entries >= num_entries:
        break
    curr_entries += 1
fin.close()
plt.barh(range(len(xs[::-1])), ys[::-1], align="center", height=1)
plt.yticks(range(len(xs)), xs[::-1])
plt.xlabel("Occurrence Count")
plt.grid()
plt.show()


Here is the code to generate the graph of relationships among the top 20 entities. Edges with higher weights have greater thickness and are also color coded - red represents the highest weights, followed by blue, green and orange. Here is the code:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# Source: src/main/python/sherlock/per_rels.py
# -*- coding: utf-8 -*-
from __future__ import division
import networkx as nx
import matplotlib.pyplot as plt
import math

G = nx.Graph()
fin = open("data/sherlock/PER-graph-edges.tsv", 'rb')
for line in fin:
    cols = line.strip().split("\t")
    G.add_edge(cols[0], cols[1], weight=int(1.0+math.log(10.0*float(cols[2]))))

graph_pos = nx.shell_layout(G)

nx.draw_networkx_nodes(G, graph_pos, node_size=5000, alpha=0.3, 
                       node_color="blue")
nx.draw_networkx_labels(G, graph_pos, font_size=12, font_family="sans-serif")

edge_widths = set(map(lambda x: x[2]["weight"], G.edges(data=True)))
colors = ["magenta", "orange", "green", "blue", "red"]
for edge_width, color in zip(edge_widths, colors):
    edge_subset = [(u, v) for (u, v, d) in G.edges(data=True) 
                                        if d["weight"] == edge_width]
    nx.draw_networkx_edges(G, graph_pos, edgelist=edge_subset, width=edge_width,
                           alpha=0.3, edge_color=color)

fig = plt.gcf()
fig.set_size_inches(15, 15)
plt.xticks([])
plt.yticks([])
plt.show()                



Not surprisingly, Sherlock Holmes is the most connected, and he is connected most strongly to Watson (red) and Lestrade (blue). The graph seems quite densely connected, also not that surprising, since we are considering the most frequent entities, which would imply that these are popular characters in the series. Unfortunately, I was unable to identify any of the characters other than the top 3. I also thought that some characters such as Mycroft Holmes (#25 frequent entity) and Prof Moriarty (#58) should have ranked higher. But that probably just means that I wasn't enough of a Holmes fan to know about the other characters :-).

Ideas for Improvement


There are some things I think I could have done that may have given better results (or at least more in line with my ideas of who the main characters should have been). But I deliberately skipped them because I wanted to see some results quickly with comparitively little effort.

The first would be to manually split the stories. As it stands, each file can either be a full novel (for example, The Hound of the Baskervilles) or a collection of short stories (for example, The Adventures of Sherlock Holmes). So it perhaps makes more sense to split the text up according to story rather than by file for weighting purposes. This may also mean revisiting the concept of average file length, since story lengths for short stories vs novels are likely to be quite different, so we may have to use local weights corresponding to the current story or do away with this measure altogether.

Another thing that I plan to try out in the future is Pronoun Resolution. Currently our entities are found from direct mentions of people names. I did try to CoreNLP's Coreference Resolution module to capture the pronouns but it generates all sorts of coreference candidates, it may be worth building something custom that only works on pronouns and associates them with PERSON mentions close enough in the past.

All the Scala code and Python code are available at these links on GitHub.