Monday, June 13, 2011

CoreNLP

This java program from Stanford University is a full-service plain-text-to-parsed-sentence utility. It takes an input phrase and finds part-of-speech tags, lemmas, sentence boundaries, and named entities, and parses the sentence. It also finds coreferents, but they appear by themselves at the bottom of the XML file and are a bunch of obscure numbers. The other results were quite simple to read. It also took 11 minutes to process two short sentences, and used so much of my system that everything slowed down massively.

If, like me, all you want to do is lemmatize, it took only 5.3 seconds for those same two sentences. Much better!

Final output is a nice xml format. So far I'm liking this for my very initial use of a lemmatizer (I can add the morphological analysis later), as long as I can figure out how to arrest the output while it's still a data structure and output it how I want to output it...

It took a while to find the javadocs online: if you google for them, they aren't anywhere high on the list, and there isn't an obvious link to them on the Stanford page. Finally, I saw a tiny link at the bottom of the page and clicked on it, and by switching back and forth several times between those and a snippet of code on the CoreNLP page (which includes no import paths), I finally started to get my bearings and be able to use the classes. It could have been easier, though.

Once you get started with it, it's fairly straightforward to get a simple class running that does what you want it to do. Which in my case is lemmatize. Next up: can I figure out how to replace its requirement of POS-tagging before it lemmatizes, with receiving my POS tags and using them to lemmatize?

No comments:

Post a Comment