Wednesday, August 07, 2013

Embedding Concepts in text for smarter searching with Solr4


Storing the concept map for a document in a payload field works well for queries that can treat the document as a bag of concepts. However, if you want to consider the concept's position(s) in the document, then you are out of luck. For queries that resolve to multiple concepts, it makes sense to rank documents with these concepts close together higher than those which had these concepts far apart, or even drop them from the results altogether.

We handle this requirement by analyzing each document against our medical taxonomy, and annotating recognized words and phrases with the appropriate concept ID before it is sent to the index. At index time, a custom token filter similar to the SynonymTokenFilter (described in the LIA2 Book) places the concept ID at the start position of the recognized word or phrase. Resolved multi-word phrases are retained as single tokens - for example, the phrase "breast cancer" becomes "breast0cancer". This allows us to rewrite queries such as "breast cancer radiotherapy"~5 as "2790981 2791965"~5.

One obvious advantage is that synonymy is implicitly supported with the rewrite. Medical literature is rich with synonyms and acronyms - for example, "breast cancer" can be variously called "breast neoplasm", "breast CA", etc. Once we rewrite the query, 2790981 will match against a document annotation that is identical for each of these various synonyms.

Another advantage is the increase of precision since we are dealing with concepts rather than groups of words. For example, "radiotherapy for breast cancer patients" would not match our query since "breast cancer patient" is a different concept than "breast cancer" and we choose the longest subsequence to annotate.

Yet another advantage of this approach is that it can support mixed queries. Assume that a query can only be partially resolved to concepts. You can still issue the partially resolved query against the index, and it would pick up the records where the pattern of concept IDs and words appear.

Finally, since this is just a (slightly rewritten) Solr query, all the features of standard Lucene/Solr proximity searches are available to you.

In this post, I describe the search side components that I built to support this approach. It involves a custom TokenFilter and a custom Analyzer that wraps it, along with a few lines of configuration code. The code is in Scala and targets Solr 4.3.0.

Custom TokenFilter and Analyzer


As mentioned earlier, my token filter is modeled on the SynonymTokenFilter described in the LIA2 book. I first built this against Solr 3.2.0 in Java, and it looked very similar to the example in the book. Since this one is written in Scala, I decided to structure it in a slightly more functional manner - specifically incrementToken() has no loops, the code path is linear, with state being held within the annotation stack. Here is the code for the TokenFilter and Analyzer.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
// Sourcae: src/main/scala/com/mycompany/solr4extras/cpos/ConceptPositionTokenFilter.scala 
package com.mycompany.solr4extras.cpos

import java.io.Reader
import java.util.Stack
import java.util.concurrent.atomic.AtomicInteger
import java.util.regex.Pattern

import org.apache.lucene.analysis.{TokenFilter, TokenStream}
import org.apache.lucene.analysis.Analyzer
import org.apache.lucene.analysis.Analyzer.TokenStreamComponents
import org.apache.lucene.analysis.core.WhitespaceTokenizer
import org.apache.lucene.analysis.tokenattributes.{CharTermAttribute, KeywordAttribute, OffsetAttribute, PositionIncrementAttribute}
import org.apache.lucene.util.Version

class ConceptPositionTokenFilter(input: TokenStream) 
    extends TokenFilter(input) {

  val TokenPattern = Pattern.compile("(\\d+\\$)+([\\S]+)");
  val AnnotationSeparator = '$';

  val termAttr = addAttribute(classOf[CharTermAttribute])
  val keyAttr = addAttribute(classOf[KeywordAttribute])
  val posAttr = addAttribute(classOf[PositionIncrementAttribute])
  val offsetAttr = addAttribute(classOf[OffsetAttribute])
  val annotations = new Stack[(String,Int,Int)]()
  val offset = new AtomicInteger()

  override def incrementToken(): Boolean = {
    if (annotations.isEmpty) {
      if (input.incrementToken()) {
        val term = new String(termAttr.buffer(), 0, termAttr.length())
        val matcher = TokenPattern.matcher(term)
        if (matcher.matches()) {
          val subtokens = term.split(AnnotationSeparator)
          val str = subtokens(subtokens.size - 1)
          clearAttributes()
          termAttr.copyBuffer(str.toCharArray(), 0, str.length())
          val startOffset = offset.get()
          val endOffset = offset.addAndGet(str.length() + 1)
          offsetAttr.setOffset(startOffset, endOffset)
          val range = 0 until subtokens.length - 1
          range.foreach(i => {
            annotations.push((subtokens(i), startOffset, endOffset))
          })
        } else {
          clearAttributes()
          termAttr.copyBuffer(term.toCharArray(), 0, term.length())
          val startOffset = offset.get()
          val endOffset = offset.addAndGet(term.length() + 1)
          offsetAttr.setOffset(startOffset, endOffset)
        }
        true
      } else 
        false
    } else {
      val (conceptId, startOffset, endOffset) = annotations.pop()
      clearAttributes()
      termAttr.copyBuffer(conceptId.toCharArray(), 0, conceptId.length())
      posAttr.setPositionIncrement(0)
      offsetAttr.setOffset(startOffset, endOffset)
      keyAttr.setKeyword(true)
      true
    }
  }
}

class ConceptPositionAnalyzer extends Analyzer {

  override def createComponents(fieldname: String, reader: Reader): 
      TokenStreamComponents = {
    val source = new WhitespaceTokenizer(Version.LUCENE_43, reader)
    val filter = new ConceptPositionTokenFilter(source)
    new TokenStreamComponents(source, filter)
  }
}

As you can see, the token filter expects the input to be in a certain format. Each token can be an unresolved word or a resolved word or phrase. Resolved word/phrases are preceded with the conceptID and a "$" sign, then the phrase is transformed into a single token by joining with "0". For example, a resolved word/phrase token would look like 1234567$word0or0phrase. If it sees this pattern, it accumulates the concept IDs (there can be multiple concept IDs for ambiguous concepts) in a stack and returns the resolved word/phrase. If the stack is non-empty, it is popped and concept ID tokens returned at the same logical position as the preceding token until the stack is empty. Unresolved words are passed through as is.

You will notice (if you look at my unit test) that the tokens that come in to my custom token filter are already stopworded and stemmed. This is a quirk of the taxonomy NER system, so we account for it by having different analyzer chains for indexing (the analyzer above, with Whitespace tokenizer followed by my custom TokenFilter) and for querying (see fieldType configuration below).

JUnit test


To test the custom token filter, I wrote the following JUnit test. It takes four strings in the expected format, and tokenizes them, and prints out the token attributes. For automating regression testing, it checks two tokens in the stream for specific values.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
// Source: src/test/scala/com/mycompany/solr4extras/cpos/ConceptPositionTokenFilterTest.scala
package com.mycompany.solr4extras.cpos

import java.io.StringReader

import org.apache.lucene.analysis.tokenattributes.{CharTermAttribute, KeywordAttribute, OffsetAttribute, PositionIncrementAttribute}
import org.junit.{Assert, Test}

class ConceptPositionTokenFilterTest {

  val TestStrings = Array[String](
    "5047841$radic0cystectomi with without 8095296$urethrectomi",
    "use mr 5047865$angiographi angiographi 9724059$ct0angiographi 8094588$stereotact0radiosurgeri intracranial avm",
    "factor influenc time 8239660$sentinel0node visual 2790981$breast0cancer patient 8100872$intradermal0injection radiotrac",
    "8129320$pretreat 9323323$vascular0endothelial0growth0factor 9204769$9323323$vegf 9160599$matrix0_metalloproteinase_9_ 9160599$mmp09 serum level patient with metastatic 8092190$non0small0cell0lung0canc nsclc"
  )
  
  @Test def testTokenFilter(): Unit = {
    val analyzer = new ConceptPositionAnalyzer()
    val expected = Map(
      ("breast0cancer" -> Token("breast0cancer", 172, 186, 1, false)),
      ("2790981" -> Token("2790981", 172, 186, 0, true)))
    TestStrings.foreach(testString => {
      Console.println("=== %s ===".format(testString))
      val tokenStream = analyzer.tokenStream("f", 
        new StringReader(testString))
      tokenStream.reset()
      while (tokenStream.incrementToken()) {
        val termAttr = tokenStream.getAttribute(classOf[CharTermAttribute])
        val offsetAttr = tokenStream.getAttribute(classOf[OffsetAttribute])
        val posAttr = tokenStream.getAttribute(
          classOf[PositionIncrementAttribute])
        val keyAttr = tokenStream.getAttribute(classOf[KeywordAttribute])
        val term = new String(termAttr.buffer(), 0, termAttr.length())
        if (expected.contains(term)) {
          val expectedToken = expected(term)
          Assert.assertEquals(expectedToken.strval, term)
          Assert.assertEquals(expectedToken.start, offsetAttr.startOffset())
          Assert.assertEquals(expectedToken.end, offsetAttr.endOffset())
          Assert.assertEquals(expectedToken.inc, posAttr.getPositionIncrement())
          Assert.assertEquals(expectedToken.isKeyword, keyAttr.isKeyword())
        }
        Console.println("  %s (@ %d, %d, %d) [keyword? %s]".format(
          term, offsetAttr.startOffset(), offsetAttr.endOffset(),
          posAttr.getPositionIncrement(), 
          if (keyAttr.isKeyword()) "TRUE" else "FALSE"))
      }
      tokenStream.end()
      tokenStream.close()
    })
  }

  case class Token(strval: String, start: Int, end: Int, 
    inc: Int, isKeyword: Boolean)
}

The (partial) output of this test shows us that the custom filter is working as expected. Specifically, it is placing the concept ID tokens at the same virtual location as the resolved entity and setting the offsets as expected. It also treats the conceptIDs as keywords in case we ever decide to stem the tokens using the indexing analyzer in Solr rather than pre-stemming them.

1
2
3
4
5
6
7
8
=== 5047841$radic0cystectomi with without 8095296$urethrectomi ===
  radic0cystectomi (@ 0, 17, 1) [keyword? FALSE]
  5047841 (@ 0, 17, 0) [keyword? TRUE]
  with (@ 17, 22, 1) [keyword? FALSE]
  without (@ 22, 30, 1) [keyword? FALSE]
  urethrectomi (@ 30, 43, 1) [keyword? FALSE]
  8095296 (@ 30, 43, 0) [keyword? TRUE]
=== ... ===

Configuration


Configuration is fairly straightforward. We declare two fields "itemtitle" and "itemtitle_cp" that stores the regular title and the annotated title for a bunch of articles as having the existing field type "text_en" and the new field type "text_cp" respectively. The text_cp field type configuration is shown below - the query side analysis pipeline is identical to text_en's. We also show the field definitions for itemtitle and itemtitle_cp fields.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
<!-- Source: solr/collection1/conf/schema.xml -->
    ....
    <!-- text_cp field type definition -->
    <fieldType name="text_cp" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index"
        class="com.mycompany.solr4extras.cpos.ConceptPositionAnalyzer"/>>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="lang/stopwords_en.txt"
                enablePositionIncrements="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" 
                protected="protwords.txt"/>
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
    </fieldType>
    ....
    <field name="itemtitle" type="text_en" indexed="true" stored="true"/>
    <field name="itemtitle_cp" type="text_cp" indexed="true" stored="true"/>
    ....

Loading data into Solr


Once configured, our Solr instance is now ready to accept some documents from our pipeline. Because the indexing process is fairly heavyweight, I decided to just pull a subset of documents from an existing Solr instance and use that to populate my test index. Here is the data loading code - it reads an XML file containing the Solr response for the first 1000 rows with only the fields itemtitle and itemtitle_cp.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
// Source: src/test/scala/com/mycompany/solr4extras/cpos/DataLoader.scala
package com.mycompany.solr4extras.cpos

import scala.io.Source
import scala.xml.NodeSeq.seqToNodeSeq
import scala.xml.XML

import org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer
import org.apache.solr.common.SolrInputDocument
import org.junit.{After, Before, Test}

class DataLoader {

  val solrUrl = "http://localhost:8983/solr/"
  val solr = new ConcurrentUpdateSolrServer(solrUrl, 10, 1)

  @Before def setup(): Unit = {
    solr.deleteByQuery("*:*")  
  }
  
  @After def teardown(): Unit = {
    solr.shutdown()
  }

  @Test def loadData(): Unit = {
    var i = 0
    val xml = XML.loadString(Source.fromFile(
      "data/cpos/cpdata.xml").mkString)
    val docs = xml \\ "response" \\ "result" \\ "doc"
    docs.foreach(doc => {
      val itemtitle = (doc \\ "str" \\ "_").
        filter(node => node.attribute("name").
        exists(name => name.text == "itemtitle")).
        map(node => node.text).
        mkString
      val itemtitle_cp = (doc \\ "str" \\ "_").
        filter(node => node.attribute("name").
        exists(name => name.text == "itemtitle_cp")).
        map(node => node.text).
        mkString
      Console.println(itemtitle)
      Console.println(itemtitle_cp)
      Console.println("--")
      val solrDoc = new SolrInputDocument()
      solrDoc.addField("id", "cpos-" + i)
      solrDoc.addField("itemtitle", itemtitle)
      solrDoc.addField("itemtitle_cp", itemtitle_cp)
      solr.add(solrDoc)
      i = i + 1
      if (i % 100 == 0) {
        Console.println("%d records processed".format(i))
        solr.commit()
      }
    })
    solr.commit()
  }
}

You can see the results of running a sample itemtitle_cp value through the analysis tool below. As you can see, the output is similar to our JUnit test, indicating that we have successfully integrated our custom token filter into Solr.


And this is all I have for today. Hope you found it interesting. All the code described here can be found on my GitHub project.

2 comments (moderated to prevent spam):

Dmitry Kan said...

Hi Sujit,

You seem to have had the same challenge we have regarding the concept matching. I've attempted to approach the task using payloads and custom similarity, but then I've found out, that the payload similarity is on document level. Even though the payloads are on the term level.

What I'm looking for now, is can a payload similarity class be informed of the incoming query parameters. Such that the query would contain the concept id in it and upon similarity calculation we would compare the query concept id and the concept id encoded in the term's payload. Have you ever thought or done something like this?

Thanks,
Dmitry

Sujit Pal said...

Hi Dmitry, one of our concept matching approaches was also using custom similarity and payloads like you mention using, and we had no problems with payload similarity working at the document level - we would map the incoming query to concept IDs then rewrite it as a combination of AND and OR queries. In this case (the scenario this post describes) payloads are not being used - the idea is that the concepts are embedded in the text as synonyms so we can use normal (ie non-payload) queries.This gave us the added benefit of being able to do proximity searches on concepts.

However, back to your question - very likely I am misunderstanding, but not sure why you would need the similarity class to know of any other incoming query parameters. Your use case sounds exactly like ours and the default functionality worked well for us. We had only one type of concept and a single field where we would store it - perhaps if you have different needs consider different payload fields and/or namespaced concept IDs?