Monday, June 13, 2011

Lemmatizers and Morphological Analyzers

I have done a *lot* of part-of-speech tagging and parsing, but I haven't quite made the jump into lemmatization...till now, that is.

I've known for a long time that I really need to add a lemmatizer to my pipeline, so I thought now is a great time. I just finished up a big video demo, my system's been improved a lot in some exciting ways, now I really need to work on robustness. And for the system to be robust, it needs to have a way of figuring out that "is" is related to "are" and "be".

Despite my incessant (and inefficient) desire to create my own tools (a desire which, sadly, is *not* supported by my willingness to drive a software development project to release-ready completion), I'm looking for a pre-existing tool that meets some constraints.

  • Ideally, it should be in Java, since most of our system is. Sure, there are a few parts that are in other languages, mostly C, but everyone in the lab has to know Java, so I know that the Java parts of our system will be maintainable in the future. The Haskall POS tagger we acquired last semester, not so much.
  • It needs to preserve the syntactic information. When I first started this, I actually believed that was typical. I had many times in a paper seen a form like: is+PST representing the lemma for was. However, I've come to understand that this is not necessarily the case. For example, I tried Morphadorner first, and at least in its default settings, it does not preserve that information, producing only be, not be+PST. My understanding, after some research, is that this is typical of morphological analyzers, not lemmatizers. However, I will review both lemmatizers and morphological analyzers, since my extremely brief survey leads me to believe that there are more freely-available lemmatizers than morphological analyzers.
  • It should be trainable. While I enjoy using systems that come pre-trained so I can get started using them right away before I've taken the time to figure out just how our corpus needs to be formatted, any domain has its quirks that can only be adequately handled by a system trained on that domain. This is likely true of most systems, but I plan to discard the odd rule-based system out-of-hand, if any exist.
As I review systems and look for the right one, I'll prepare a post, but in the meantime....any suggestions? :D

No comments:

Post a Comment