Showing posts with label lemmatizer. Show all posts
Showing posts with label lemmatizer. Show all posts

Monday, June 13, 2011

CoreNLP

This java program from Stanford University is a full-service plain-text-to-parsed-sentence utility. It takes an input phrase and finds part-of-speech tags, lemmas, sentence boundaries, and named entities, and parses the sentence. It also finds coreferents, but they appear by themselves at the bottom of the XML file and are a bunch of obscure numbers. The other results were quite simple to read. It also took 11 minutes to process two short sentences, and used so much of my system that everything slowed down massively.

If, like me, all you want to do is lemmatize, it took only 5.3 seconds for those same two sentences. Much better!

Final output is a nice xml format. So far I'm liking this for my very initial use of a lemmatizer (I can add the morphological analysis later), as long as I can figure out how to arrest the output while it's still a data structure and output it how I want to output it...

It took a while to find the javadocs online: if you google for them, they aren't anywhere high on the list, and there isn't an obvious link to them on the Stanford page. Finally, I saw a tiny link at the bottom of the page and clicked on it, and by switching back and forth several times between those and a snippet of code on the CoreNLP page (which includes no import paths), I finally started to get my bearings and be able to use the classes. It could have been easier, though.

Once you get started with it, it's fairly straightforward to get a simple class running that does what you want it to do. Which in my case is lemmatize. Next up: can I figure out how to replace its requirement of POS-tagging before it lemmatizes, with receiving my POS tags and using them to lemmatize?

MorphAdorner

I got this up and running fairly quickly. It came in a nice jar file, and at first I was confused about the input format because it seemed that it would accept only XML format, but I couldn't figure out how to structure it. Finally, all else failed and I looked in the user manual, which told me how to run it on a plain text example. I quickly patched together a test text file featuring the sentence:

This is a test of using Morphadorner to adorn plain english (modern) texts.

ran it, and got the following output:

This    This    d       This    this    0
is      is      vbz     is      be      0
a       a       dt      a       a       0
test    test    n1      test    test    0
of      of      pp-f    of      of      0
using   using   vvg     using   use     0
Morphadorner    Morphadorner    n1      Morphadorner    morphadorner    0
to      to      pc-acp  to      to      0
adorn   adorn   vvi     adorn   adorn   0
plain   plain   j       plain   plain   0
english english n1      english english 0
(       (       (       (       (       0
modern  modern  j       modern  modern  0
)       )       )       )       )       0
texts   texts   n2      texts   text    0
.       .       .       .       .       1

Somewhere in the user manual I found an explanation, not of the fields themselves in field order, but the sort of information that the fields might contain, and was able to match each field to the matching definition.

It was then that I realized a lemmatizer is not a morphological analyzer...

Lemmatizers and Morphological Analyzers, Part 2

This post is beginning as an initial list of lemmatizers and morphological analyzers to try, and will morph over time into reviews of those lemmatizers.

Claimed lemmatizers:
  • MorphAdorner (http://morphadorner.northwestern.edu/)
  • FreeLing (http://nlp.lsi.upc.edu/freeling/)
  • NLTK (http://www.nltk.org/)
  • CoreNLP (http://nlp.stanford.edu/software/corenlp.shtml)
  • GATE (http://gate.ac.uk/)
Claimed morphological analyzers:
  • ENGTWOL (http://www2.lingsoft.fi/cgi-bin/engtwol?word=was)
  • mmorph (http://aune.lpl.univ-aix.fr/projects/multext/)
  • PC-KIMMO (http://www.sil.org/pckimmo/about_pc-kimmo.html)
  • Morpha (http://www.informatics.sussex.ac.uk/research/groups/nlp/carroll/morph.html)
    Maybe:
    • LingPipe (http://alias-i.com/lingpipe/)

    Lemmatizers and Morphological Analyzers

    I have done a *lot* of part-of-speech tagging and parsing, but I haven't quite made the jump into lemmatization...till now, that is.

    I've known for a long time that I really need to add a lemmatizer to my pipeline, so I thought now is a great time. I just finished up a big video demo, my system's been improved a lot in some exciting ways, now I really need to work on robustness. And for the system to be robust, it needs to have a way of figuring out that "is" is related to "are" and "be".

    Despite my incessant (and inefficient) desire to create my own tools (a desire which, sadly, is *not* supported by my willingness to drive a software development project to release-ready completion), I'm looking for a pre-existing tool that meets some constraints.

    • Ideally, it should be in Java, since most of our system is. Sure, there are a few parts that are in other languages, mostly C, but everyone in the lab has to know Java, so I know that the Java parts of our system will be maintainable in the future. The Haskall POS tagger we acquired last semester, not so much.
    • It needs to preserve the syntactic information. When I first started this, I actually believed that was typical. I had many times in a paper seen a form like: is+PST representing the lemma for was. However, I've come to understand that this is not necessarily the case. For example, I tried Morphadorner first, and at least in its default settings, it does not preserve that information, producing only be, not be+PST. My understanding, after some research, is that this is typical of morphological analyzers, not lemmatizers. However, I will review both lemmatizers and morphological analyzers, since my extremely brief survey leads me to believe that there are more freely-available lemmatizers than morphological analyzers.
    • It should be trainable. While I enjoy using systems that come pre-trained so I can get started using them right away before I've taken the time to figure out just how our corpus needs to be formatted, any domain has its quirks that can only be adequately handled by a system trained on that domain. This is likely true of most systems, but I plan to discard the odd rule-based system out-of-hand, if any exist.
    As I review systems and look for the right one, I'll prepare a post, but in the meantime....any suggestions? :D