Friday, May 09, 2008

Tokenizing Text with ICU4j's RuleBasedBreakIterator

The field of Text Mining is relatively uncharted territory for me, and I wanted to learn about it, so I bought this book - "Text Mining Application Programming" by Manu Kochandy. I liked the book a lot, read my review on Amazon for the superlatives. In any case, the first step to mining text is to tokenize the input so a program can do stuff with it, so I set about trying to build a tokenizer that would take a paragraph (from Kochandy's book, embellished with a few more contrived sentences for testing) like the one shown below:

1
2
3
4
Jaguar will sell its new XJ-6 model in the U.S. for a small fortune :-). 
Expect to pay around USD 120ks. Custom options can set you back another 
few 10,000 dollars. For details, go to <a href="http://www.jaguar.com/sales" 
alt="Click here">Jaguar Sales</a> or contact xj-6@jaguar.com.

...split it up into sentences, and then into word tokens. I ended up using the RuleBasedBreakIterator from the ICU4j project. Before that, however, I tried and discarded various other alternatives, which I briefly describe below.

I first considered splitting the sentence up by whitespace. However, I would not have been able to capture the <a href=...> as a single token. I then considered using a custom set of punctuation characters and whitespace and splitting by that. This is even worse, since words such as 10,000, U.S and XJ-6 are now treated as multiple tokens.

I next considered using the word instance of the java.text.BreakIterator. I was already using the sentence instance to split up the paragraph. This yielded slightly better results, recognizing 10,000 as a single token, but that was about it.

I then created a custom subclass of BreakIterator, which used a combination of regular expression patterns and exploited character collocation to do a much better job. I was accessing the BreakIterator through my WordTokenizer class and the only method it was accessing on the BreakIterator was its next() method, I just extended that one method. The next() method delegated to a reference to a BreakIterator.getWordInstance() that is created during the subclass construction, something like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
public class MyCustomBreakIterator extends BreakIterator {
  private BreakIterator wordInstance;

  public MyCustomBreakIterator() {
    this.wordInstance = BreakIterator.getWordInstance();
    ...
  }

  public int next() {
    // override behavior here, use wordInstance to figure out
    // break points to return...
    ...
  }

  ...
}

The basic idea was that each break point, I would test the remaining string with various regular expression patterns, and if I found a match, then advance my break point to that position. This is similar to the approach used in the Perl code in Kochandy's TextMine project. I also exploited character collocation using some ideas from Richard Gillam's article on boundary analysis. Specifically, at each break point reported by the underlying BreakIterator, I checked to see if the character preceding and following the break point was either a letter or digit. Only if it was, the break point would be reported upwards as a valid boundary, otherwise the BreakIterator would advance to the next position.

I was halfway through this approach when I realized that the code had become pretty messy with lots of nested conditions, so much so, that I was having trouble adding functionality to it without breaking something else. I had cursorily read the limited documentation (for the Java version) available for RuleBasedBreakIterator, but I had been staying away from it because of the learning curve involved. However, at this point, I decided to abandon the subclassing approach and give the RuleBasedBreakIterator a try.

The default RuleBasedBreakIterator did not perform much better than the BreakIterator from java.text, which was no surprise, since they are expected to function similarly. The trick was to be able to define rules that override the default behavior.

The first hurdle I faced was figuring out what rules to define. User-level documentation is sparse for ICU4j - the only thing remotely useful was the EBNF chart describing the rule file format. I tried looking at the source for examples in test code, but found that pretty difficult to understand as well. Finally, I figured out how to dump out the default rules, figuring that would provide me with a reasonable starting point.

1
2
3
RuleBasedBreakIterator rbbi = 
  (RuleBasedBreakIterator) BreakIterator.getWordInstance(Locale.getDefault());
String defaultRules = rbbi.toString();

After some trial and error, I was able to finally figure out how to define my custom rules. Shown below are the default rules and the custom rules I added to tokenize abbreviations, hyphenated words, emoticons, email addresses, URLs and XML (or HTML) markup. As you can see, the patterns are defined in the !!chain section, and the forward (next()) rules are defined in the !!forward section. The break point moves forward to the longest pattern that can be matched using the rules. Although the ICU4j documentation says that all four sections (forward, reverse, safe_forward and safe_reverse) must be defined, my code worked with only this, perhaps because sensible defaults were already defined in the default rules.

If you are a Java programmer, one thing to note is that the regular expressions used in this file are not Java regular expressions, but rather expressions defined in the the Single Unix Regular Expression Specification that is used by Perl and grep (ie [:punct:] rather than \p{Punct}).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
!!chain;
$VoiceMarks = [\uff9e\uff9f];
$Format = [\p{Word_Break = Format}];
$Katakana = [\p{Word_Break = Katakana}-$VoiceMarks];
$ALetter = [\p{Word_Break = ALetter}];
$MidLetter = [\p{Word_Break = MidLetter}];
$MidNum = [\p{Word_Break = MidNum}];
$Numeric = [\p{Word_Break = Numeric}];
$ExtendNumLet = [\p{Word_Break = ExtendNumLet}];
$CR = \u000d;
$LF = \u000a;
$Extend = [\p{Grapheme_Cluster_Break = Extend}$VoiceMarks];
$Control = [\p{Grapheme_Cluster_Break = Control}];
$dictionary = [:LineBreak = Complex_Context:];
$ALetterPlus = [$ALetter [$dictionary-$Extend-$Control]];
$KatakanaEx = $Katakana     ($Extend |  $Format)*;
$ALetterEx = $ALetterPlus  ($Extend |  $Format)*;
$MidLetterEx = $MidLetter    ($Extend |  $Format)*;
$MidNumEx = $MidNum       ($Extend |  $Format)*;
$NumericEx = $Numeric      ($Extend |  $Format)*;
$ExtendNumLetEx = $ExtendNumLet ($Extend |  $Format)*;
$Hiragana = [:Hiragana:];
$Ideographic = [:IDEOGRAPHIC:];
$HiraganaEx = $Hiragana ($Extend |  $Format)*;
$IdeographicEx = $Ideographic  ($Extend |  $Format)*;
# ============= Custom Rules ================
# Abbreviation: Uppercase alpha chars separated by period and optionally followed by a period 
$Abbreviation = [A-Z0-9](\.[A-Z0-9])+(\.)*;
# Hyphenated Word : sequence of letter or digit, (punctuated by - or _, with following letter or digit sequence)+
$HyphenatedWord = [A-Za-z0-9]+([\-_][A-Za-z0-9]+)+;
# Email address: sequence of letters, digits and punctuation followed by @ and followed by another sequence
$EmailAddress = [A-Za-z0-9_\-\.]+\@[A-Za-z][A-Za-z0-9_]+\.[a-z]+;
# Internet Addresses: http://www.foo.com(/bar)
$InternetAddress = [a-z]+\:\/\/[a-z0-9]+(\.[a-z0-9]+)+(\/[a-z0-9][a-z0-9\.]+);
# XML markup: A run begins with < and ends with the first matching >
$XmlMarkup = \<[^\>]+\>; 
# Emoticon: A run that starts with :;B8{[ and contains only one or more of the following -=/{})(
$Emoticon = [B8\:\;\{\[][-=\/\{\}\)\(]+; 

!!forward;
$CR $LF  ($Extend | $Format)*;
.? ($Extend |  $Format)+;
$NumericEx {100};
$ALetterEx {200};
$KatakanaEx {300};
$HiraganaEx {300};
$IdeographicEx {400};
$ALetterEx $ALetterEx {200};
$ALetterEx $MidLetterEx $ALetterEx {200};
$NumericEx $NumericEx {100};
$ALetterEx $Format* $NumericEx {200};
$NumericEx $ALetterEx {200};
$NumericEx $MidNumEx $NumericEx {100};
$KatakanaEx $KatakanaEx {300};
$ALetterEx $ExtendNumLetEx {200};
$NumericEx $ExtendNumLetEx {100};
$KatakanaEx $ExtendNumLetEx {300};
$ExtendNumLetEx $ExtendNumLetEx{200};
$ExtendNumLetEx $ALetterEx  {200};
$ExtendNumLetEx $NumericEx  {100};
$ExtendNumLetEx $KatakanaEx {300};
# Custom : Abbreviation
$Abbreviation {500};
$HyphenatedWord {501};
$EmailAddress {502};
$InternetAddress {503};
$XmlMarkup {504};
$Emoticon {505};

!!reverse;
$BackALetterEx = ($Format | $Extend)* $ALetterPlus;
$BackNumericEx = ($Format | $Extend)* $Numeric;
$BackMidNumEx = ($Format | $Extend)* $MidNum;
$BackMidLetterEx = ($Format | $Extend)* $MidLetter;
$BackKatakanaEx = ($Format | $Extend)* $Katakana;
$BackExtendNumLetEx= ($Format | $Extend)* $ExtendNumLet;
($Format | $Extend)* $LF $CR;
($Format | $Extend)*  .?;
$BackALetterEx $BackALetterEx;
$BackALetterEx $BackMidLetterEx $BackALetterEx;
$BackNumericEx $BackNumericEx;
$BackNumericEx $BackALetterEx;
$BackALetterEx $BackNumericEx;
$BackNumericEx $BackMidNumEx $BackNumericEx;
$BackKatakanaEx $BackKatakanaEx;
($BackALetterEx | $BackNumericEx | $BackKatakanaEx | $BackExtendNumLetEx) $BackExtendNumLetEx;
$BackExtendNumLetEx ($BackALetterEx | $BackNumericEx | $BackKatakanaEx);

!!safe_reverse;
($Extend | $Format)+ .?;
$MidLetter $BackALetterEx;
$MidNum $BackNumericEx;
$dictionary $dictionary;

!!safe_forward;
($Extend | $Format)+ .?;
$MidLetterEx $ALetterEx;
$MidNumEx $NumericEx;
$dictionary $dictionary;

You will notice references to Katakana and Hiragana, and to $dictionary, even though I am parsing ASCII text. This stuff came in as part of the default rules, even though I specified Locale.getDefault(), which to me is Locale.US, so I left it in there. The $dictionary refers to dictionary based lookup, which I am not sure its even doing, but thats something I need to explore later.

So anyway, here is my WordTokenizer class which calls the RuleBasedBreakIterator.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
package com.mycompany.myapp.tokenizers;

import java.io.File;
import java.util.Map;

import org.apache.commons.io.FileUtils;
import org.apache.commons.lang.ArrayUtils;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;

import com.ibm.icu.text.BreakIterator;
import com.ibm.icu.text.RuleBasedBreakIterator;

public class WordTokenizer {
  
  private final Log log = LogFactory.getLog(getClass());

  @SuppressWarnings("unchecked")
  private final static Map<Integer,TokenType> RULE_ENTITY_MAP = 
    ArrayUtils.toMap(new Object[][] {
      {new Integer(0), TokenType.UNKNOWN},
      {new Integer(100), TokenType.NUMBER},
      {new Integer(200), TokenType.WORD},
      {new Integer(500), TokenType.ABBREVIATION},
      {new Integer(501), TokenType.WORD},
      {new Integer(502), TokenType.INTERNET},
      {new Integer(503), TokenType.INTERNET},
      {new Integer(504), TokenType.MARKUP},
      {new Integer(505), TokenType.EMOTICON},
  });
  
  private String text;
  private int index = 0;
  private RuleBasedBreakIterator breakIterator;
  
  public WordTokenizer() throws Exception {
    super();
    this.breakIterator = new RuleBasedBreakIterator(
      FileUtils.readFileToString(
      new File("src/main/resources/word_break_rules.txt"), "UTF-8"));
  }
  
  public void setText(String text) {
    this.text = text;
    this.breakIterator.setText(text);
    this.index = 0;
  }
  
  public Token nextToken() throws Exception {
    for (;;) {
      int end = breakIterator.next();
      if (end == BreakIterator.DONE) {
        return null;
      }
      String nextWord = text.substring(index, end);
      log.debug("next=" + nextWord + "[" + breakIterator.getRuleStatus() + "]");
      index = end;
      return new Token(nextWord, 
        RULE_ENTITY_MAP.get(breakIterator.getRuleStatus()));
    }
  }
}

The Token bean is a simple JavaBean which exposes the value and type properties via standard getter and setter properties. I have removed the getters and setters for brevity, any IDE can generate them for you.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
package com.mycompany.myapp.tokenizers;

public class Token {

  public Token() {
    super();
  }
  
  public Token(String value, TokenType type) {
    this();
    setValue(value);
    setType(type);
  }
  
  private String value;
  private TokenType type;
  
  ... getters and setters here
 
  @Override
  public String toString() {
    return value + " (" + type + ")";
  }
}

The TokenType is an enum that lists the various entities that I want to capture at some point.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
package com.mycompany.myapp.tokenizers;

public enum TokenType {
  ABBREVIATION, 
  COMBINED, 
  COLLOCATION, 
  EMOTICON, 
  INTERNET, 
  WORD,
  NUMBER, 
  WHITESPACE,
  PUNCTUATION, 
  PLACE, 
  ORGANIZATION,
  MARKUP, 
  UNKNOWN
}

The SentenceTokenizer uses a SentenceInstance from java.text.BreakIterator and seems to do a fine job for the input I am feeding it. I did notice issues with dynamic URLs, where the embedded '?' character is treated as a sentence delimiter. Fortunately, however, I don't expect my text to have such URLs, at least not yet. In case that happens, though, I could change my SentenceTokenizer to use a RuleBasedBreakIterator as well.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
package com.mycompany.myapp.tokenizers;

import java.text.BreakIterator;
import java.util.Locale;

public class SentenceTokenizer {

  private String text;
  private int index = 0;
  private BreakIterator breakIterator;
  
  public SentenceTokenizer() {
    super();
    this.breakIterator = BreakIterator.getSentenceInstance(
      Locale.getDefault());
  }
  
  public void setText(String text) {
    this.text = text;
    this.breakIterator.setText(text);
    this.index = 0;
  }
  
  public String nextSentence() {
    int end = breakIterator.next();
    if (end == BreakIterator.DONE) {
      return null;
    }
    String sentence = text.substring(index, end);
    index = end;
    return sentence;
  }
}

My JUnit code for parsing the input text looks like this. Since creating a RuleBasedBreakIterator is expensive, we create it once by instantiating the WordTokenizer once per run. We do the same with the SentenceTokenizer too to be consistent, even though there is no corresponding reason.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
package com.mycompany.myapp.tokenizers;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.junit.Test;

import com.mycompany.jrocker.tokenizers.SentenceTokenizer;
import com.mycompany.jrocker.tokenizers.WordTokenizer;

public class TokenizerTest {

  private final Log log = LogFactory.getLog(getClass());
  
  @Test
  public void testTokenizingForExamplePara() throws Exception {
    String paragraph = "Jaguar will sell its new XJ-6 model in the U.S. for " +
      "a small fortune :-). Expect to pay around USD 120ks. Custom options " +
      "can set you back another few 10,000 dollars. For details, go to " +
      "<a href=\"http://www.jaguar.com/sales\" alt=\"Click here\">" +
      "Jaguar Sales</a> or contact xj-6@jaguar.com.";
    SentenceTokenizer sentenceTokenizer = new SentenceTokenizer();
    WordTokenizer wordTokenizer = new WordTokenizer();
    sentenceTokenizer.setText(paragraph);
    String sentence = null;
    while ((sentence = sentenceTokenizer.nextSentence()) != null) {
      log.debug("sentence=[" + sentence + "]");
      wordTokenizer.setText(sentence);
      Token token = null;
      while ((token = wordTokenizer.nextToken()) != null) {
        log.debug("token=" + token.getValue() + " [" + token.getType() + "]");
      }
    }
  }
}

Here is the output of the test (formatted for readability, kind of). I think this tokenization is much better, but I will let you be the judge. I do not strip out the whitespace as advised using the heuristic in the BreakIterator Javadocs, since I want to be able to just recognize the entities as specific tokens, and then be able to reconstruct the sentence back into its original form from the component tokens.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
sentence=[Jaguar will sell its new XJ-6 model in the U.S. for a small fortune :-). ]
token=Jaguar [WORD]
token=  [UNKNOWN]
token=will [WORD]
token=  [UNKNOWN]
token=sell [WORD]
token=  [UNKNOWN]
token=its [WORD]
token=  [UNKNOWN]
token=new [WORD]
token=  [UNKNOWN]
token=XJ-6 [WORD]
token=  [UNKNOWN]
token=model [WORD]
token=  [UNKNOWN]
token=in [WORD]
token=  [UNKNOWN]
token=the [WORD]
token=  [UNKNOWN]
token=U.S. [ABBREVIATION]
token=  [UNKNOWN]
token=for [WORD]
token=  [UNKNOWN]
token=a [WORD]
token=  [UNKNOWN]
token=small [WORD]
token=  [UNKNOWN]
token=fortune [WORD]
token=  [UNKNOWN]
token=:-) [EMOTICON]
token=. [UNKNOWN]
token=  [UNKNOWN]
sentence=[Expect to pay around USD 120ks. ]
token=Expect [WORD]
token=  [UNKNOWN]
token=to [WORD]
token=  [UNKNOWN]
token=pay [WORD]
token=  [UNKNOWN]
token=around [WORD]
token=  [UNKNOWN]
token=USD [WORD]
token=  [UNKNOWN]
token=120ks [WORD]
token=. [UNKNOWN]
token=  [UNKNOWN]
sentence=[Custom options can set you back another few 10,000 dollars. ]
token=Custom [WORD]
token=  [UNKNOWN]
token=options [WORD]
token=  [UNKNOWN]
token=can [WORD]
token=  [UNKNOWN]
token=set [WORD]
token=  [UNKNOWN]
token=you [WORD]
token=  [UNKNOWN]
token=back [WORD]
token=  [UNKNOWN]
token=another [WORD]
token=  [UNKNOWN]
token=few [WORD]
token=  [UNKNOWN]
token=10,000 [NUMBER]
token=  [UNKNOWN]
token=dollars [WORD]
token=. [UNKNOWN]
token=  [UNKNOWN]
sentence=[For details, go to <a href="http://www.jaguar.com/sales" alt="Click here">Jaguar Sales</a> or contact xj-6@jaguar.com.]
token=For [WORD]
token=  [UNKNOWN]
token=details [WORD]
token=, [UNKNOWN]
token=  [UNKNOWN]
token=go [WORD]
token=  [UNKNOWN]
token=to [WORD]
token=  [UNKNOWN]
token=<a href="http://www.jaguar.com/sales" alt="Click here"> [MARKUP]
token=Jaguar [WORD]
token=  [UNKNOWN]
token=Sales [WORD]
token=</a> [MARKUP]
token=  [UNKNOWN]
token=or [WORD]
token=  [UNKNOWN]
token=contact [WORD]
token=  [UNKNOWN]
token=xj-6@jaguar.com [INTERNET]
token=. [UNKNOWN]

A nice side effect of using the RuleBasedBreakIterator is the ruleStatus() value it returns at every break point. This is the id of the rule that matched at the particular break point. So if the current break point was generated by matching a number pattern from the last break point, the rule would be 100. So it is possible to set a preliminary token type value for each token returned from the RuleBasedBreakIterator, instead of deferring this wholely to the Entity Recognition phase (which I will blog about next week).

Update 2009-04-26: In recent posts, I have been building on code written and described in previous posts, so there were (and rightly so) quite a few requests for the code. So I've created a project on Sourceforge to host the code. You will find the complete source code built so far in the project's SVN repository.

24 comments (moderated to prevent spam):

betty said...

Hi,sujit
Recently I have read this poster,which give me a greate help on my research. I import all classes on the poster into eclipse, but when I test the TokenizerTest.class,appearing

Error, the problem seems to be word_break_rules.txt, the contents of this document are part of this poster in the top of the 1 ~ 99 all right? Or are there other places to pay attention to?

Look forward to your reply, thank you

Sujit Pal said...

Hi Betty, glad to hear that you found the post useful. When I wrote this post, I found the ICU4J rule file syntax quite hard to understand too. However, more recently I needed to build sentence and paragraph tokenizers, and I found some resources which helped me understand the rules better, you may find these helpful too. Look at the links under the sections "Sentence Tokenization" and "Paragraph Tokenization" in this post.

betty said...

hi sujit..
Recently I read you articles.
I am working on classify WSDL documents.First,I extract some keywords.Now I want to use tf-idf algorithm to label terms.Do you think it avaliable?Another,do you have some suggestions to classify WSDL documents?

Look forward your answer.
Thanks Betty.

Sujit Pal said...

Hi Betty, not sure what you mean by classifying WSDL documents (or why you would want to do this), but if you've extracted keywords, then yes, you can build a term-document matrix on it, and then do a TF*IDF normalization on it. I see you've already taken a look at my TF, IDF and LSI post, if you now want to classify them, take a look at some of my recent posts on classification here and here.

pavel.veinik said...

Hi, Sujit.

I'm creating a syntactic analyzer of Russian and other languages, and this post was extremely useful for me. I'm very glad I've seen it, I'm reading other posts in your blog and I'm looking forward for future posts :)

Thank you.

Anonymous said...

Hi Sujit,

Please how I go about downloading your codes from the svn repository?

Sujit Pal said...

You can find the code here:
http://jtmt.sourceforge.net/
click on the download link, I believe it gets you the tarball off svn. Alternatively, you can point your svn client at: https://jtmt.svn.sourceforge.net/svnroot/jtmt
and you should be able to grab the latest from it.

Unknown said...

Hey Sujit,

I am new to Java. This code is great. How can I view the output of the JUnit test? I am not sure what happens with the apache logging.

Thanks
Chris

Sujit Pal said...

Thanks Chris. To view the logging output, you can either include the log4j.properties file in your classpath (if you are calling Java from the command line or via a script, you may need to use a -Dlog4j.configuration switch), or change the log.xxx() calls to System.out.println(). The latter approach is quicker to get working, but (a) buffer the output and (b) not tell you the source, so I would recommend you take a bit of time to figure out log4j with commons-logging - there are a lot of resources available on the web about this stuff.

Anonymous said...

Hello,

I am trying to leverage the rules you're using for your SentenceTokenizer and I'm finding that it's breaking on every character for me. (Apparently, it's breaking on rule 101 for every character?) There may be something wrong my implementation (which looks a lot like yours). Did you run into this behavior at all?

Thanks,
Robert (robert dot voyer at google dot com)

Sujit Pal said...

Hi Robert, I also found the ICU4J hideously hard to use and debug. The only upside (at the time) I saw was that it was very configurable. I haven't actually seen what you are seeing (dont see Rule 101 defined in the config file, so not sure what it is). I haven't used this in a while since one of my colleagues has built a custom Lucene analyzer/tokenizer which does this already (not using ICU4J, its all home grown code), and we just use that to tokenize.

cithinker said...

Hi Sujit,
You'r posts are great and gives a great insight.I am trying to collect json feeds from social media sites and trying to normalize and do tfidf on some key words. Can you help and point me to a code that can clean up json feeds normalise them and find the tfidf on some keywords that I will be trying to find in the ose json feeds.
Please Advice
Thanks
cithinker
Please Advice

Sujit Pal said...

Hi Anshu, you can use a standard JSON parser (json-lib, jackson, etc) to parse out the text content of the feeds into text files. Your other comment (which I deleted because it was a duplicate) contains references to twitter messages which are relatively short. For TF/IDF, you may want to take a look at the methods described in Dr Konchady's book (referenced in this post) and/or use the code in my JTMT project (jtmt.sf.net). The problem with the stuff in JTMT is that they don't scale too well for very large data sets since the calculations are all in memory, but depending on the size of data you have it may be adequate. If not, you may want to use JTMT (or other similar projects) as a starting point for your work.

cithinker said...

Hi Sujit,
Thanks for the reply. I am trying to use the JTMT code but getting an exception int the pom.xml itself. Could you point to what the problem could be. Thanks again for all the help.
Anshu


Apache Maven 3.0.3 (r1075438; 2011-02-28 09:31:09-0800)
Maven home: C:\Tools\maven
Java version: 1.6.0_23, vendor: Sun Microsystems Inc.
Java home: C:\Program Files (x86)\Java\jdk1.6.0_23\jre
Default locale: en_US, platform encoding: Cp1252
OS name: "windows 7", version: "6.1", arch: "x86", family: "windows"
[INFO] Error stacktraces are turned on.
[DEBUG] Reading global settings from C:\Tools\maven\conf\settings.xml
[DEBUG] Reading user settings from C:\Tools\maven\conf\settings.xml
[DEBUG] Using local repository at C:\Tools\maven\repository
[DEBUG] Using manager EnhancedLocalRepositoryManager with priority 10 for C:\Tools\maven\repository
[INFO] Scanning for projects...
[DEBUG] Extension realms for project net.sf:jtmt:jar:1.0-SNAPSHOT: (none)
[DEBUG] Looking up lifecyle mappings for packaging jar from ClassRealm[plexus.core, parent: null]
[ERROR] The build could not read 1 project -> [Help 1]
org.apache.maven.project.ProjectBuildingException: Some problems were encountered while processing the POMs:
[WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-compiler-plugin is missing. @ line 612, column 15
[WARNING] 'build.plugins.plugin.version' for org.mortbay.jetty:maven-jetty6-plugin is missing. @ line 636, column 15
[WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-war-plugin is missing. @ line 620, column 15
[ERROR] 'build.plugins.plugin[org.mortbay.jetty:maven-jetty6-plugin].dependencies.dependency.scope' for org.apache.geronimo.specs:geronimo-j2ee_1.4_spec:jar must be one of [compile, runtime, system] but is 'provided'. @ line 653, column 20

at org.apache.maven.project.DefaultProjectBuilder.build(DefaultProjectBuilder.java:339)
at org.apache.maven.DefaultMaven.collectProjects(DefaultMaven.java:632)
at org.apache.maven.DefaultMaven.getProjectsForMavenReactor(DefaultMaven.java:581)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:233)
at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:156)
at org.apache.maven.cli.MavenCli.execute(MavenCli.java:537)
at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:196)
at org.apache.maven.cli.MavenCli.main(MavenCli.java:141)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:290)
at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:230)
at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:409)
at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:352)
[ERROR]
[ERROR] The project net.sf:jtmt:1.0-SNAPSHOT (C:\Projects\jtmt\pom.xml) has 1 error
[ERROR] 'build.plugins.plugin[org.mortbay.jetty:maven-jetty6-plugin].dependencies.dependency.scope' for org.apache.geronimo.specs:geronimo-j2ee_1.4_spec:jar must be one of [compile, runtime, system] but is 'provided'. @ line 653, column 20
[ERROR]
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/ProjectBuildingException

cithinker said...

Hi Sujit,
Also I wanted to ask about the jtmt scalability issues. Do you think it would be a good idea to convert json feeds to pojos using the apis or traverse the plain jsons themselves and retrive the text. I agree that to get a cleaner text from these json feeds from social sites using the clinets will give us a cleaner feed but then scalability could be compromised? Please Advice and thanks for replying.
Anshu

Sujit Pal said...

Hi Anshu,

My personal preference is to use APIs if they have been provided but then of course you are governed/limited by the provider's terms of use. Scraping does not have that limitation, but you have to go through the pain of identifying the boilerplate and removing it, and it will break every time the provider decides to change his HTML. But YMMV :-).

Hadee said...

Dear Sujit
I am working on bioinformatic text mining. would you please tell me how I can run your tokenizing in detailed.I got confused after several tryes.

Sujit Pal said...

Hi Hadee, while the ICU4j RBBI is very convenient, its rule file syntax is quite hard to understand. You may find some more information in the ICU4j example here. Alternatively, if you don't need behavior beyond the one provided by Java's BreakIterator you may just want to use that.

RAVI said...

Hi this is RAVI
My project is related to Textmining , and Your blog is really helping me in understanding and coding for my project.
I am getting "TYPE MISMATCH" error in WordTokenizer.java file at MAP.
Please can You help me out why is it an error even both are Objects?

Sujit Pal said...

Hi Ravi, I don't see a MAP in my code for WordTokenizer - did you mean the RULE_ENTITY_MAP? If so, not sure why, you will probably have to provide more details (your stack trace would help).

melissa said...

Hi Sujit Pal,

This post is great and what I am looking for. I've tried to implement your codes but about the RuleBasedBreakIterator, if I understand it right, it is the rule you defined. Is there a class to it or it is referring to somewhere else? because I cant see where the RuleBasedBreakIterator is coming from.

Anticipating your reply.
Thank you.

Sujit Pal said...

Hi Melissa, the RuleBasedBreakIterator is a class and comes from teh ICU4j project jar. The file of rules are configuration for the class.

Mohammed said...

Hey Sujit,

I am trying to implement a bag of words model. I wanted to compute a measure of cosine similarity. So, as I was working through your cosine similarity example I found out that you had used classes from your previous posts. So, I am trying to understand how you changed the rules. I tried dumping them and that worked fine. But I cannot figure out how you defined custom rules. Could you please help me understand what I'm doing wrong?

Thanks and I'm looking forward to your reply.

--Mohammed

Sujit Pal said...

Hi Mohammed, its been a while, and I found that ICU4J was a bit of overkill for my needs (and a bit of a pain to customize as you have noted) so I abandoned it in favor of OpenNLP's word tokenizer. But to answer your question, my rules are in the blocks marked with comments in the configuration file shown in the post. The ICU4J RuleBasedBreakIterator constructor takes a reference to this this configuration file. I was trying to make the sentence and word break iterators "smarter" by defining for it things like abbreviations and internet addresses so it treats them as a single word instead of breaking across the periods. The nice thing about OpenNLP is that it uses a MaxEnt classifier so it can be trained with examples (although I found the default English one good enough for my purposes) rather than hand-crafted regex rules.