Tuesday, June 28, 2011

Mouse vs. keyboard vs....joystick.

My first computer was a Commodore 64, and I've never really liked the mouse. I do chuckle along with some people when I explain emacs commands, but I love 'em! (Although, a quick aside, I think that when I am using a mouse, I'd actually like one for each hand. And I love the idea that the joystick was initially intended to be a productivity tool. I have no idea why it failed where the mouse succeeded.)

Anyway, here's an interesting article, courtesy of itworld.com via slashdot.org, about 7 days without a mouse.

Penn treebank standard training/development/test divisions

The Penn Treebank is often used in natural language processing tasks. Typically, I use 10-fold cross validation, but there is a standard test set that is often used in the parsing community. Sections 02-21 are the training set, 23 is the development set, and 24 is the test set.

Using this, and the connl format (1 token per line, a blank line between sentences), there are 1,088,220 lines in the training set, 41,821 in the development set, and 59,118 in the test set. That means the test set is about 5.4% of the training set. It's good to have as much training data as possible, but that makes for a really large training set. That's going to require some very memory-efficient machine learning...

Machine learning with large datasets

I use a trainable dependency parser to produce semantics of natural language expressions, but I typically use fairly small datasets in order to stay focused on the particular language phenomena of interest. I'm trying to scale my dataset up, and I'm finding there's a whole set of issues I have to deal with. In particular, of course, the larger the dataset the more memory it sucks up.

I'm trying to use the penn treebank, which is basically huge. MaltParser uses TiMBL; I'm using Weka (which I chose entirely because it was the easiest Java machine-learning package to use); I wonder if there's any super-memory-efficient machine learning library? Or do I just really need to sample my dataset? Or run it on IU's cluster?

The basic problem is this: Natural language is hard. To do well, you need a huge dataset. However, machine learning is intense, so you need a huge computer. That means the robot needs access to a huge computer. However, if the computer is running a scheduling system....there's going to be a real-time problem.

Monday, June 27, 2011

ant buildfile

I am not too familiar with ant, but today I successfully managed to modify someone's ant buildfile to see my  very own files:

The original file had this:

<path id="compile.classpath">
                <fileset dir="${lib}">
                        <include name="weka.jar"/>
                        <include name="postaipc-0.8.5.jar"/>
                        <include name="clipc-0.2.jar"/>
                </fileset>
        </path>

and I simply modified it to:

<path id="compile.classpath">
                <fileset dir="${lib}">
                        <include name="weka.jar"/>
                        <include name="postaipc-0.8.5.jar"/>
                        <include name="clipc-0.2.jar"/>
                </fileset>
                <pathelement location="/home/me/myfiles"/>
        </path>

Thursday, June 23, 2011

Using System.out.print(ln?)()

I often use System.out.println and its sister, System.err.println.

However, I'd like to be able to easily switch the type of output my program emits, between not just stdout and a file, but also different formats, each of which would subclass BufferedWriter in order to produce the write format.

I don't know though, because then I'm either trying to return strings from each of my classes, or each class has to have access to the output stream, or I end up passing the output stream around a lot.

Wednesday, June 22, 2011

Proof that LaTex is far superior to any WYSIWYG slide creator

After creating the graphics for my animation, easily and with no swearing at all, as described in the previous blog post. I was faced with the challenge of putting them into OpenOffice.

I thought for a minute, and cried to myself, "God I wish I were doing the slides in LaTex!"

Why?

Because this would be enough to create my slides:

\documentclass{beamer}
\mode<presentation>

\usepackage{graphicx}

\newcommand{\animatestep}[1]{\begin{frame}\frametitle{Step #1}\begin{center}\includegraphics{graphic-#1-crop.pdf}
\end{center}\end{frame}}

\begin{document}

\animatestep{1}
\animatestep{2}
\animatestep{3}
\animatestep{4}
\animatestep{5}
\animatestep{6}
\animatestep{7}
\animatestep{8}
\animatestep{9}
\animatestep{10}
\animatestep{11}
\animatestep{12}

\end{document}

And best of all, having once done this and noticed a flaw in the basic code I used to create this (I forgot to turn gray on for the arc labels), I just had to correct the flaw with a quick insertion of \g into each slide and run my compile command again:

> for next in 0 1 2 3 4 5 6 7 8 9 10 11; do latex graphic-$next.tex; dvipdf graphic-$next.dvi;\
 pdfcrop graphic-$next.pdf; done
> pdflatex test-animation.tex

and the whole thing is fixed. In OpenOffice, I would have had to manually replace each graphic with the corrected one.

That determined me! From now on, I'm creating a LaTex mockup of my slides before I create the final version in OpenOffice!

Creating open-office slides from a LaTex presentation

While my preferred method of creating slides is, naturally, using (excuse me, "utilizing") LaTex, there are times when it is required, for reasons of organizational consistency to use Open Office and a pre-defined template to create slides. Typically, the slides are for a paper for which I have already spent hours creating dependency graphics. In the past, slide-creation time has then required yet more hours (okay, "hours" is an exaggeration, but I stand by the sentiment it expresses) of dependency-graphic creation.

This is one reason why I was so excited to finally master the creation of pdf images, which I now intend to use in my slides! We'll see how it works!

My requirements are:

1) I must be able to seamlessly animate the progress of the dependency graph. In other words, from one slide to the next, the graphics must be placed identically so that it seems that, instead of changing slides, I have simple added a new word or arc to the initial slide as an animation would do.

2) The graphs must visually cohere with my template.

3) Creating them must not be labor-intensive.

I think that's all. For now.


Implementation of visual coherence:

First, the easy part. The font style is bitstream vera sans, which has been implemented as a package called bera:

\usepackage{bera}

This activates the vera fonts, but now, you need to tell it to use sans serif instead of serif:

\renewcommand{\familydefault}{\sfdefault}

Next I need to color my fonts 80\% gray:

\usepackage{xcolor}

\color{gray!80} this is the text!

and resize it to 24pt:

\fontsize{24pt}{24pt}
\selectfont
this is the text!

So far so good.

The font carries over into the psmatrix environment, and I can use the coloration with it (though it can't span an ampersand, so I have to color each word separately).

The arcs are colored by a command, unearthed from deep within the xcolor manual (http://mirror.ctan.org/macros/latex/contrib/xcolor/xcolor.pdf):
  • \psset{linecolor=green!50}
  • \psset{linecolor=[rgb]{0.5,1,0.5}}
  • \psframebox[linecolor={[rgb]{0.5,1,0.5}}]{foo}
You should recognize that \psset{} command from my earlier tutorial on dependency graphs.

Finally, as I mentioned, the coloring does not cross ampersand boundaries, but fortunately all the arc labels are actually considered to be within the "group" comprising the last box of the matrix row, so the arc labels, if left alone, will be the color of the last word in the utterance.

I have now fully formatted my graphic like the slide I am hoping to match!

Implementation of animation:

I have done a lot of pseudo-animation in both open office and powerpoint, and the way I typically add things is:

  1. create my finished graphic using text and arcs. (It's important that my "graphics" are never real graphics, merely lines of text adorned by arcs.)
  2. make the parts that should appear last match the background color, effectively making them invisible

This has two advantages:
  1.  I create the finished "graph"-ic, copy it, change the copy so it looks like the step before the final graphic, copy this, etc, which is basically the least labor-intensive way of creating these graphics
  2. If the graphic is centered, just erasing the parts that shouldn't have appeared yet will make it so that the visible graphics will not be placed correctly. Having the exact same content, just colored differently, ensures a perfect placement.
So I was happy to discover that in the psmatrix environment, you can change the default arc color halfway through! This takes care of my ability to animate my graphs stepwise!

So now, I have a neatly animated set of pdf graphics representing the stepwise creation of a dependency graph, which I created in text mode. (Have I mentioned how much I prefer using text to a gui? :D This proves the age-old adage: you can take the woman away from the Commodore 64, but you can't take the Commodore 64 out of the woman!)

Friday, June 17, 2011

Creating a LaTex Stylesheet or Package

This turns out to be quite simple:

You create a preamble, complete with \{newcommand}s and \{usepackage}s and \{definecolor}s, then you remove all this premable into a file with the extension .sty, e.g. test.sty. Thereafter, you simply \usepackage{test}.

Why should you do this?

Well, if you're like me, you use the same formatting in paper after paper. Do this to the semantics! Do this to robot names! Do this to natural language input! So if you don't have a customized package, you'll probably end up cutting and pasting from one paper to the next. Cutting and pasting is as bad in paper creation as it is in software creation!

Creating a customized environment in Latex

\newenvironment{<env-name>}[<n-args>][<default>]{<begin-code>}{<end-code>}

Why create customized commands and environments in LaTex?

LaTex is meant to be used to separate form (essentially, formatting) from content. It contains plenty of commands for formatting -- otherwise you couldn't format your content at all. The quickest path to formatted content is to combine your formatting and content. Basically, to format as you go along. So I often see people just use the formatting commands provided throughout their paper. (My sample is purely academic users, who are using LaTex almost exclusively to produce papers for academic publication.) It doesn't take too much thinking to figure out that this goes against the entire basis for using LaTex. Not only that, it's inefficient, especially for academic papers.

Papers typically need to be an extremely certain length, and meet other very specific formatting criteria. As a result, the last step to producing almost any paper is to make sure those criteria are met. One common result is that some time is wasted going through the paper adding and removing \small{}s and other size-related commands.

Also, many papers have multiple coauthors. This can result in formatting inconsistencies. You may choose to italicize natural language input, while your coauthor puts it in a verbatim environment.

The key to using LaTex efficiently and consistently is to create commands that describe the type of content, and apply the formatting. This allows you to format consistently by changing it in one place. It also effectively separates form and content. For example, in my papers, I use a lot of dialogues, essentially scripts describing what the human said to the robot and vice versa. This is how I format them generically:

\newenvironment{dialog}
{\begin{alltt}}
{\end{alltt}}


and this is how I formatted them in the case of a specific paper with specific formatting requirements:


\newenvironment{dialog}
{\vspace{-2mm}\begin{small}\begin{alltt}}
{\end{alltt}\end{small}\vspace{-2mm}}

Thursday, June 16, 2011

Creating image files from pstricks code

Yesterday I explained how to create a dependency graph in LaTex using pstricks. It is simple to include the code in your pdf file directly, but there are two major problems with this approach:

  1. You must then use latex and dvipdf in order to compile the file, rather than just pdflatex. 
  2. It lacks modularity. This is particularly problematic when you are not completely sure what the finished graph should look like. I often create a graph, decide to try something else, comment out all or part of the graph and make my changes. This results in a lot of commented-out code, particularly if I add comments to explain what differentiates one graph from another. It would be cleaner to create a graphic with a descriptive name and include that descriptively-named graphic in a file. Then, even if I decide to try something different, I can simply create a second graphic, and my text file remains much cleaner.
The steps to creating a dependency graph (or any other figure) as a stand-alone pdf file that can then be included as a graphic is to:
  1. create the graphic normally as you would any in any pdf file
  2. produce the pdf file. It will be a full-size page with excess white space.
  3. crop the excess white space.
  4. include the cropped output file in your destination file.
1. Create the graphic normally as you would any in any pdf file.

Follow the steps in my previous post. You'll now have a tex file. You don't want page numbers, so make sure the preamble includes the following command:

\pagestyle{empty}

2. Produce the pdf file. It will be a full-size page with excess white space.

You can't use pdflatex with pstricks; instead use

latex example.tex
dvipdf example.dvi

This will produce the output example.pdf. It's a full-size page including excess white space.

3. Crop the excess white space.

The command

pdfcrop example.pdf

will create the output file example-crop.pdf.

4. Include the cropped output file in your destination file.

 First, add to the preamble

\usepackage{graphicx}

\begin{center}                                    
\includegraphics[]{example-crop.pdf}
\end{center}

5. Compile your final file. Unless something else is preventing you, you can now use pdflatex to compile the final file.

You should now have created a dependency graph as a stand-alone file, and included it in a separate pdf file! It took me a long time to get all these steps. I was stuck on trying to use dvipng, which I was not able to get to recognize the arcs in the dependency graph. So I hope this prevents someone else from having the same trouble!

Wednesday, June 15, 2011

Dependency Graphs

I have to create a lot of dependency graphs for papers, and I've tried three different methods with success:

  • the LaTex pstricks package - This is my favorite because it allows you to type a series of commands directly into your paper, and you end up with a beautiful graph. You can create a lovely graph and view it in pdf without ever lifting your hands from the keyboardr. Better yet, the more you know, the more beautiful a graph you can produce! I learned this by emulating the code of my Computational Linguistics professor, Dr. Sandra Kuebler. She knows a lot about LaTex. Sadly, my Cognitive Science professor, Dr. Matthias Scheutz, says pstricks is a "really old package" that "no one uses anymore", requiring me to use...
  • OmniGraffle - I also love this method. It's laborious, but you can create your graphs without ever lifting your hands from the mouse, and you hardly have to know anything to produce beautiful graphs. Sadly, this costs money, and I'm a poor student, necessitating me to use...
  • dia - the free open-source version of OmniGraffle.
 These are all good method, but one of them requires more knowledge than the others. That's why, in this blog post, I will be explaining how to, step by step, create a simple dependency graph in LaTex using pstricks.

Draw your desired dependency graph by hand. This will make entering it into the computer much easier.  

Set up your document to accept the graph by making sure you are importing the correct packages. Most likely, you'll need the following in your preamble:
    \usepackage{pstricks,pst-node,pst-tree,pstricks-add}


    Create the proper environment, which consists of two pstricks environments that you'll create inside a normal figure environment:

    \begin{figure}
    \begin{pspicture}(x,y)
    \begin{psmatrix}[nodesep=z, rowsep=q, colsep=r]
    ...
    \end{psmatrix}
    \end{pspicture}
    \end{figure}

    (x,y) are the desired width and height of the figure you'll be creating, as dimensionless units. Playing with the width when the pspicture is centered within the figure environment had the effect of offsetting center, so leave this 0, but it has no effect on any border. Playing with the height actually changes the height of a border around the graph. If you do have a border, it is possible to make the size small enough that the border falls on the graph, rather than around it.

    z, q, r are parameters in the formed of dimensioned units, e.g. 2pt, 0.5cm, or 0.3cm [the values I used in the example I copied these instructions from]

    Rowsep is only important if you have multiple rows. Just play with these values until things are placed as you like them.

    Type in the sentence to graph as the first line in the psmatrix. Since it's a matrix, they'll be separated by ampersands rather than spaces:

    point&to&the&table&with&the&blue&pen

    Set the default values.

    Arrow direction: Decide whether you'd rather chant to yourself: "2 from 1", or "1 to 2".  As I'm transferring my graph, I look at an arc, chant one of the above while I enter it into the computer. If you would rather start with the node the arrow is pointing towards, your direction will be "<-". If you would rather start with the node the arrow is pointing from, your direction will be "->":

    \psset{arrows=<-}

    Angle:

    \psset{angle=90}

    Default arc height: If you want the default to be the shortest, choose 20pt. If you want it to be medium, chose 40pt.
    \psset{arm=20pt}

    Count the number of arcs you will need to enter, and begin entering them. For each arc, add the following line:

    \ncbar[]{1,}{1,}\nbput{\small }

    when you have a completed arc, the line will look something like:

    \ncbar[]{1,x}{1,y}\nbput{\small z}

    where x is the node you start with, so the one the arrow is pointing towards or from, depending on the direction you selected, y is the second node, and z is the label of the arc. Inside the square brackets will be options such as how the arrow is centered horizontally above the node, and how tall the arrow is.

    For each arc, enter the node numbers and the label.

    \ncbar[]{1,2}{1,1}\nbput{\small obj} 

    For me, this created an object arc to node 2 from node 1.
    Enter the arc heights.

    If you have multiple arcs to, or from a given node, they will need to be different heights so the arrows are clearly differentiated from each other. I do this starting from the end by visually selecting in turn each node from that has multiple arrows leaving or arriving at it. If your graph does not contain crossing dependencies, some of the arcs will be "inside" others:

    [insert picture]

    make the inside 1 20pt, and add 20pts for each enveloping arc.

    Enter this informationi so that the lines look something like:

    \ncbar[arm=60pt ]{1,8}{1,5}\nbput{\small obj}
    \ncbar[arm=40pt ]{1,6}{1,8}\nbput{\small obj}
    \ncbar[arm=20pt ]{1,7}{1,8}\nbput{\small obj}

    Set the horizontal offsets. The arrows will all start from and arrive at the same point on the node unless you offset some of the arrows. There's an offset for the first node listed [offsetA] and the second node listed[offsetB].

    So in the line

    \ncbar[offsetB=-3pt]{1,8}{1,7}\nbput{\small }

    The arrow pointing to node 7 is offset by -3pt.

    For offsetA, a negative number moves it right, while for offsetB, a negative number moves it left.

    Enter the offsets like this:

    \ncbar[arm=60pt,offsetA=1.5pt]{1,5}{1,8}\nbput{\small }
    \ncbar[arm=40pt,offsetB=-3pt]{1,8}{1,6}\nbput{\small }
    \ncbar[offsetB=0pt]{1,8}{1,7}\nbput{\small }

    If you use spaces instead of commas, the words after the first space will all be listed under the graph instead of being executed in the graph.

    Compile. You won't be able to use pdflatex anymore, instead you'll have to use latex followed by dvipdf:

    latex example.tex
    dvipdf example.dvi

    Unfortunately, if you are sharing a tex file with someone, say over subversion, they may be annoyed that you have now broken pdflatex. They may be in a position to insist you do not do this (see above). How could this be fixed? If you can produce a graphic from a file with just the information above, and then insert that file into a document. I haven't figured out how to do it, but I'd sure like to!

    Monday, June 13, 2011

    Incremental processing headaches

    Being in Natural Language Human-Robot Interaction (NLHRI), we do incremental processing. It increases speed by allowing processes to occur in parallel, and allows cool things like the robot looking at what you're talking about *while* you're saying it.

    However, the biggest headache is that most NLP software isn't meant to be incremental, it's still mostly structured around the idea of reading in a whole document, processing the entire document, and then producing some results. Even algorithms that would be perfectly fine incrementally typically aren't programmed so they're easy to use that way. This results in a lot of needless duplication of effort when I finally get frustrated attempting to increment a non-incremental program and just write my own.

    CoreNLP

    This java program from Stanford University is a full-service plain-text-to-parsed-sentence utility. It takes an input phrase and finds part-of-speech tags, lemmas, sentence boundaries, and named entities, and parses the sentence. It also finds coreferents, but they appear by themselves at the bottom of the XML file and are a bunch of obscure numbers. The other results were quite simple to read. It also took 11 minutes to process two short sentences, and used so much of my system that everything slowed down massively.

    If, like me, all you want to do is lemmatize, it took only 5.3 seconds for those same two sentences. Much better!

    Final output is a nice xml format. So far I'm liking this for my very initial use of a lemmatizer (I can add the morphological analysis later), as long as I can figure out how to arrest the output while it's still a data structure and output it how I want to output it...

    It took a while to find the javadocs online: if you google for them, they aren't anywhere high on the list, and there isn't an obvious link to them on the Stanford page. Finally, I saw a tiny link at the bottom of the page and clicked on it, and by switching back and forth several times between those and a snippet of code on the CoreNLP page (which includes no import paths), I finally started to get my bearings and be able to use the classes. It could have been easier, though.

    Once you get started with it, it's fairly straightforward to get a simple class running that does what you want it to do. Which in my case is lemmatize. Next up: can I figure out how to replace its requirement of POS-tagging before it lemmatizes, with receiving my POS tags and using them to lemmatize?

    MorphAdorner

    I got this up and running fairly quickly. It came in a nice jar file, and at first I was confused about the input format because it seemed that it would accept only XML format, but I couldn't figure out how to structure it. Finally, all else failed and I looked in the user manual, which told me how to run it on a plain text example. I quickly patched together a test text file featuring the sentence:

    This is a test of using Morphadorner to adorn plain english (modern) texts.

    ran it, and got the following output:

    This    This    d       This    this    0
    is      is      vbz     is      be      0
    a       a       dt      a       a       0
    test    test    n1      test    test    0
    of      of      pp-f    of      of      0
    using   using   vvg     using   use     0
    Morphadorner    Morphadorner    n1      Morphadorner    morphadorner    0
    to      to      pc-acp  to      to      0
    adorn   adorn   vvi     adorn   adorn   0
    plain   plain   j       plain   plain   0
    english english n1      english english 0
    (       (       (       (       (       0
    modern  modern  j       modern  modern  0
    )       )       )       )       )       0
    texts   texts   n2      texts   text    0
    .       .       .       .       .       1

    Somewhere in the user manual I found an explanation, not of the fields themselves in field order, but the sort of information that the fields might contain, and was able to match each field to the matching definition.

    It was then that I realized a lemmatizer is not a morphological analyzer...

    Lemmatizers and Morphological Analyzers, Part 2

    This post is beginning as an initial list of lemmatizers and morphological analyzers to try, and will morph over time into reviews of those lemmatizers.

    Claimed lemmatizers:
    • MorphAdorner (http://morphadorner.northwestern.edu/)
    • FreeLing (http://nlp.lsi.upc.edu/freeling/)
    • NLTK (http://www.nltk.org/)
    • CoreNLP (http://nlp.stanford.edu/software/corenlp.shtml)
    • GATE (http://gate.ac.uk/)
    Claimed morphological analyzers:
    • ENGTWOL (http://www2.lingsoft.fi/cgi-bin/engtwol?word=was)
    • mmorph (http://aune.lpl.univ-aix.fr/projects/multext/)
    • PC-KIMMO (http://www.sil.org/pckimmo/about_pc-kimmo.html)
    • Morpha (http://www.informatics.sussex.ac.uk/research/groups/nlp/carroll/morph.html)
      Maybe:
      • LingPipe (http://alias-i.com/lingpipe/)

      Lemmatizers and Morphological Analyzers

      I have done a *lot* of part-of-speech tagging and parsing, but I haven't quite made the jump into lemmatization...till now, that is.

      I've known for a long time that I really need to add a lemmatizer to my pipeline, so I thought now is a great time. I just finished up a big video demo, my system's been improved a lot in some exciting ways, now I really need to work on robustness. And for the system to be robust, it needs to have a way of figuring out that "is" is related to "are" and "be".

      Despite my incessant (and inefficient) desire to create my own tools (a desire which, sadly, is *not* supported by my willingness to drive a software development project to release-ready completion), I'm looking for a pre-existing tool that meets some constraints.

      • Ideally, it should be in Java, since most of our system is. Sure, there are a few parts that are in other languages, mostly C, but everyone in the lab has to know Java, so I know that the Java parts of our system will be maintainable in the future. The Haskall POS tagger we acquired last semester, not so much.
      • It needs to preserve the syntactic information. When I first started this, I actually believed that was typical. I had many times in a paper seen a form like: is+PST representing the lemma for was. However, I've come to understand that this is not necessarily the case. For example, I tried Morphadorner first, and at least in its default settings, it does not preserve that information, producing only be, not be+PST. My understanding, after some research, is that this is typical of morphological analyzers, not lemmatizers. However, I will review both lemmatizers and morphological analyzers, since my extremely brief survey leads me to believe that there are more freely-available lemmatizers than morphological analyzers.
      • It should be trainable. While I enjoy using systems that come pre-trained so I can get started using them right away before I've taken the time to figure out just how our corpus needs to be formatted, any domain has its quirks that can only be adequately handled by a system trained on that domain. This is likely true of most systems, but I plan to discard the odd rule-based system out-of-hand, if any exist.
      As I review systems and look for the right one, I'll prepare a post, but in the meantime....any suggestions? :D