Thursday, March 27, 2014

Parsing Drug Dosages in text using Finite State Machines


Someone recently pointed out an issue with the Drug Dosage FSM in Apache cTakes on the cTakes mailing list. Looking at the code for it revealed a fairly complex implementation based on a hierarchy of Finite State Machines (FSM). The intuition behind the implementation is that Drug Dosage text in doctor's notes tend to follow a standard-ish format, and FSMs can be used to exploit this structure and pull out relevant entities out of this text. The paper Extracting Structured Medication Event Information from Discharge Summaries has more information about this problem. The authors provide their own solution, called the Merki Medication Parser. Here is a link to their Online Demo and source code (Perl).

I've never used FSMs myself, although I have seen it used to model (more structured) systems. So the idea of using FSMs for parsing semi-structured text such as this seemed interesting and I decided to try it out myself. The implementation I describe here is nowhere nearly as complex as the one in cTakes, but on the flip side, is neither as accurate, nor broad nor bulletproof either.

My solution uses drug dosage phrase data provided in this Pattern Matching article by Erin Rhode (which also comes with a Perl based solution), as well as its dictionaries (with additions by me), to model the phrases with the state diagram below. I built the diagram by eyeballing the outputs from Erin Rhode's program. I then implement the state diagram with a home-grown FSM implementation based on ideas from Electric Monk's post on FSMs in Python and the documentation for the Java library Tungsten FSM. I initially tried to use Tungsten-FSM, but ended up with extremely verbose Scala code because of Scala's stricter generics system.


The FSM implementation is described below. It is built up by the client adding all the states (the nodes in the diagram above) and transitions (the edges connecting the nodes). Each transition specifies a Guard, essentially a predicate that dictates if the machine can transition to the target state, and an Action that defines the actions to be taken when the transition is attempted. For each input text to be parsed, the client constructs a new FSM, then passes in a list of tokens to the run() method, which applies the transition to each token in the list.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
// Source: src/main/scala/com/mycompany/scalcium/utils/FSM.scala
package com.mycompany.scalcium.utils

import scala.collection.mutable.ArrayBuffer

trait Guard[T] {
  def accept(token: T): Boolean
}

trait Action[T] {
  def perform(currState: String, token: T): Unit
}

class FSM[T](val debug: Boolean = false) {

  val states = ArrayBuffer[String]()
  val transitions = scala.collection.mutable.Map[
    String,ArrayBuffer[(String,Guard[T],Action[T])]]()
  var currState: String = "START"
  
  def addState(state: String): Unit = {
    states += state
  }
  
  def addTransition(from: String, to: String,
      guard: Guard[T], action: Action[T]): Unit = {
    val translist = transitions.getOrElse(from, ArrayBuffer()) 
    translist += ((to, guard, action))
    transitions.put(from, translist)
  } 
  
  def transition(token: T): Unit = {
    val targetStates = transitions.getOrElse(currState, List())
      .filter(tga => tga._2.accept(token))
    if (targetStates.size == 1) {
      val targetState = targetStates.head
      if (debug)
        Console.println("%s -> %s".format(currState, targetState._1))
      currState = targetState._1
      targetState._3.perform(currState, token)
    } else {
      transitions.get(currState) match {
        case Some(x) => x.head._3.perform(currState, token)
        case None => {}
      }
    }
  }
  
  def run(tokens: List[T]): Unit = tokens.foreach(transition(_))
}

The client instantiates the FSM, then populates its states and transitions by calls to addState() and addTransition() respectively. The client also defines custom Guard objects. In general, a Guard for a transition from A to B checks to see if the current token is acceptable for B. If so, it makes the transition, else it remains in state A. Since there are multiple states transitioning to the END state, we want to make it the least attractive target state, hence the noGuard - a state transition from some state to the END state will only happen if the token stream is exhausted and there are no better alternatives.

The other Guards are the DictGuard, the RegexGuard and the CombinedGuard. The DictGuard is a Gazetteer, accepting a token that matches an entry in its input file. The RegexGuard matches a pattern that is specified in its input file, and a CombinedGuard is a combination of a DictGuard and RegexGuard.

The CollectAction captures (state, token) pairs as the FSM processes the tokens. The pairs are returned as a Map[state,List[token]] using the getSymbolTable() method of CollectAction. The action is associated with the FSM rather than a state, but I retained the Action per State idea just in case I ever need it.

1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
// Source: src/main/scala/com/mycompany/scalcium/drugdosage/DrugDosageFSM.scala
package com.mycompany.scalcium.drugdosage

import java.io.File
import java.util.regex.Pattern

import scala.collection.TraversableOnce.flattenTraversableOnce
import scala.collection.mutable.ArrayBuffer
import scala.io.Source

import com.mycompany.scalcium.utils.Action
import com.mycompany.scalcium.utils.FSM
import com.mycompany.scalcium.utils.Guard

class BoolGuard(val refValue: Boolean) extends Guard[String] {
  override def accept(token: String): Boolean = refValue
}

class DictGuard(val file: File) extends Guard[String] {
  val words = Source.fromFile(file).getLines()
    .map(line => line.toLowerCase().split(" ").toList)
    .flatten
    .toSet
  
  override def accept(token: String): Boolean = {
    words.contains(token.toLowerCase())
  }
}

class RegexGuard(val file: File) extends Guard[String] {
  val patterns = Source.fromFile(file).getLines()
    .map(line => Pattern.compile(line))
    .toList
  
  override def accept(token: String): Boolean = {
    patterns.map(pattern => {
      val matcher = pattern.matcher(token)
      if (matcher.matches()) return true
    })
    false
  }
}

class CombinedGuard(val file: File, val pfile: File) 
      extends Guard[String] {
  val words = Source.fromFile(file).getLines()
    .map(line => line.toLowerCase().split(" ").toList)
    .flatten
    .toSet
  val patterns = Source.fromFile(pfile).getLines()
    .map(line => Pattern.compile(line))
    .toList
    
  override def accept(token: String): Boolean = {
    acceptWord(token) || acceptPattern(token)
  }
  
  def acceptWord(token: String): Boolean = 
    words.contains(token.toLowerCase())
    
  def acceptPattern(token: String): Boolean = {
    patterns.map(pattern => {
      val matcher = pattern.matcher(token)
      if (matcher.matches()) return true
    })
    false
  }
}

class CollectAction(val debug: Boolean) extends Action[String] {
  val stab = new ArrayBuffer[(String,String)]()
  
  override def perform(currState: String, token: String): Unit = {
    if (debug)
      Console.println("setting: %s to %s".format(token, currState))
    stab += ((currState, token))
  }
  
  def getSymbolTable(): Map[String,List[String]] = {
    stab.groupBy(kv => kv._1)
      .map(kv => (kv._1, kv._2.map(_._2).toList))
  }
}

class DrugDosageFSM(val drugFile: File, 
    val freqFile: File, 
    val routeFile: File,
    val unitsFile: File,
    val numPatternsFile: File,
    val debug: Boolean = false) {

  
  def parse(s: String): Map[String,List[String]] = {
    val collector = new CollectAction(debug)
    val fsm = buildFSM(collector, debug)
    val x = fsm.run(s.toLowerCase()
        .replaceAll("[,;]", " ")
        .replaceAll("\\s+", " ")
        .split(" ")
        .toList)
    collector.getSymbolTable()
  }
  
  def buildFSM(collector: CollectAction, debug: Boolean): FSM[String] = {
    val fsm = new FSM[String](debug)
    // states
    fsm.addState("START")
    fsm.addState("DRUG")
    fsm.addState("DOSAGE")
    fsm.addState("ROUTE")
    fsm.addState("FREQ")
    fsm.addState("QTY")
    fsm.addState("REFILL")
    fsm.addState("END")
  
    val noGuard = new BoolGuard(false)
    val drugGuard = new DictGuard(drugFile)
    val dosageGuard = new CombinedGuard(unitsFile, numPatternsFile)
    val freqGuard = new DictGuard(freqFile)
    val qtyGuard = new RegexGuard(numPatternsFile)
    val routeGuard = new DictGuard(routeFile)
    val refillGuard = new CombinedGuard(unitsFile, numPatternsFile)
    
    // transitions
    fsm.addTransition("START", "DRUG", drugGuard, collector)
    fsm.addTransition("DRUG", "DOSAGE", dosageGuard, collector)
    fsm.addTransition("DRUG", "FREQ", freqGuard, collector)
    fsm.addTransition("DRUG", "ROUTE", routeGuard, collector)
    fsm.addTransition("DOSAGE", "ROUTE", routeGuard, collector)
    fsm.addTransition("DOSAGE", "FREQ", freqGuard, collector)
    fsm.addTransition("ROUTE", "FREQ", freqGuard, collector)
    fsm.addTransition("FREQ", "QTY", qtyGuard, collector)
    fsm.addTransition("FREQ", "END", noGuard, collector)
    fsm.addTransition("QTY", "REFILL", refillGuard, collector)
    fsm.addTransition("QTY", "END", noGuard, collector)
    fsm.addTransition("REFILL", "END", noGuard, collector)
    
    fsm
  }
}  

We call this client using the JUnit test code below:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
// Source: src/test/scala/com/mycompany/scalcium/drugdosage/DrugDosageFSMTest.scala
package com.mycompany.scalcium.drugdosage

import org.junit.Test
import java.io.File
import scala.io.Source
import java.io.PrintWriter
import java.io.FileWriter

class DrugDosageFSMTest {

  val datadir = "/path/to/data/directory"
    
  @Test
  def testParse(): Unit = {
    val ddFSM = new DrugDosageFSM(
        new File(datadir, "drugs.dict"),
        new File(datadir, "frequencies.dict"),
        new File(datadir, "routes.dict"),
        new File(datadir, "units.dict"),
        new File(datadir, "num_patterns.dict"),
        false)
    val writer = new PrintWriter(new FileWriter(
      new File(datadir, "fsm_output.txt")))
    Source.fromFile(new File(datadir, "input.txt"))
      .getLines()
      .foreach(line => {
         val stab = ddFSM.parse(line)
         writer.println(line)
         stab.toList
           .sortBy(kv => kv._1)
           .foreach(kv => 
             writer.println(kv._1 + ": " + kv._2.mkString(" ")))
         writer.println()
    })
    writer.flush()
    writer.close()
  }
}

As mentioned already, the input data and the entity dictionaries come from Erin Rhode's Pattern Matching article. I have made a few additions to these files, and added a new file for specifying numeric patterns called num_patterns.dict, so my version of the data files can be found on Github here.

Although the output is not perfect, it doesn't look too bad. Here are the first 5 results. The full output can be viewed in the fsm_output.txt file on Github in the link above.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
hydrocortizone cream, apply to rash bid
DRUG: hydrocortizone cream
FREQ: bid
ROUTE: apply to rash

albuterol inhaler one to two puffs bid
DOSAGE: one to two puffs
DRUG: albuterol inhaler
FREQ: bid

Enteric coated aspirin 81 mg tablets one qd
DOSAGE: 81 mg tablets one
DRUG: enteric coated aspirin
FREQ: qd

Vitamin B12 1000 mcg IM
DOSAGE: 1000 mcg
DRUG: vitamin b12
ROUTE: im

atenolol 50 mg tabs one qd, #100, one year
DOSAGE: 50 mg tabs one
DRUG: atenolol
FREQ: qd #100
QTY: one year
...

There is obviously lot of scope for improvement here. For one thing, I suspect that my state diagram is much too simple to model the phrases. But its a start, and I plan on exploring some other solutions as well. More when I do.

Be the first to comment. Comments are moderated to prevent spam.