I'm extremely disorganized, and I suffer from a lot of information duplication. (I'm working on it.)
In the meantime, however, I need to get my LaTex citations handled! So I'm using bibtool on an .aux file and a set of bib files to produce a bib file just for that paper, with only citations that are required by the paper. Once I get organized, I won't need to do this. Till then:
/home/pscherme/bin/bibtool -s -d -x all_papers.aux review.bib newbib2.bib system2.bib new.bib > all_papers.bib
Thursday, November 17, 2011
Wednesday, October 26, 2011
Insert something from the kill ring after the cursor
Another accidental discovery: Ctrl-u, typed before the typical Ctrl-y, means the thing that is inserted will be after the cursor instead of the default before.
Monday, August 22, 2011
Java pun
It is when I am coding that I am most likely to use the terms "argh" and "rargh". Today while fretting over some command-line argument handling, I find I wrote the following in my notes:
"Arg, what is the usual way of doing this???"
It took me a minute to figure out why I appeared to be addressing the argument...
"Arg, what is the usual way of doing this???"
It took me a minute to figure out why I appeared to be addressing the argument...
Monday, August 1, 2011
My Least Favorite Thing About stackoverflow
...when people comment a question with, "I don't know LaTex, but you shouldn't do what you want to do because of (insert totally subjective opinion, such as 'tables look better without vertical lines anyway' here)".
It just makes me want to mod your comment down "irrelevant"! And it happens all the time with the LaTex questions!
And also, for the 2000th time: yes, I realize tex.stackexchange exists. Yes, I realize that the person who asked the question could have answered it themselves with 2 minutes on google and a toy example to play with. But I learn a lot about things I never even thought about when LaTex novices ask easily-googled questions on stackoverflow, so stop with the silly comments! If you can't answer the question, then don't answer it, okay!
It just makes me want to mod your comment down "irrelevant"! And it happens all the time with the LaTex questions!
And also, for the 2000th time: yes, I realize tex.stackexchange exists. Yes, I realize that the person who asked the question could have answered it themselves with 2 minutes on google and a toy example to play with. But I learn a lot about things I never even thought about when LaTex novices ask easily-googled questions on stackoverflow, so stop with the silly comments! If you can't answer the question, then don't answer it, okay!
Thursday, July 21, 2011
Emacs
The main reason I love emacs is because sometimes, I will inadvertantly hit a strange key combination and -- something unexpected will happen! Today I learned through this method that M-c will capitalize a word. That's a feature I don't need often, but whenever I do -- emacs will be there!
Update: I learned this while preparing to work on creating slides from a paper. A few minutes later, as I was collecting all statements of each main idea from my tex file and assimilating them into one single, complete statement for each main idea, I was about to manually capitalize a word but I remembered in time: Meta-c!
Update: I learned this while preparing to work on creating slides from a paper. A few minutes later, as I was collecting all statements of each main idea from my tex file and assimilating them into one single, complete statement for each main idea, I was about to manually capitalize a word but I remembered in time: Meta-c!
Monday, July 18, 2011
Bayes vs. Markov
I was perhaps unjustifiably surprised as I was going through the Naive Bayes classifier model to find that it looks very similar to something I'm already quite familiar with. Basically, if you start from Bayes' Theorem and go one direction (conditional probability), then make independence assumptions, you end up with the model for the NB classifier. If you go a different direction (chain rule) and then make independency assumptions, you end up with a Markov model. I'm guessing a lot of other models are quite similar too...
Prior, posterior
I don't know a lot about Bayesian statistics, but I'd like to understand a few terms. I often hear "prior" and "posterior" thrown around, and here's my understanding of them after a look at Wikipedia:
It seems the prior (or prior probability) is the measure of uncertainness of an event without taking any evidence (specific features) into account.
Apparently the posterior (or again, posterior probability) is the conditional probability assigned after relevant evidence is taken into account.
So, now to construct an example that illustrates what I currently believe about these concepts: If, in a given corpus, 50% of the tokens are determiners, then the chance of selecting a token you know nothing about it and finding it to be a determiner is 50%. I believe that's the prior. However, if 70% of tokens occurring after verbs are determiners, then the posterior probability is the conditional probability P(determiner|verb) ["probability of a determiner given a verb"], so 70%.
In Bayes Theorem, which is the one part of Bayesian anything that I am one might almost say *too* familiar with, the prior is multiplied by the likelihood function and then normalized to obtain the posterior. So:
or, equivalently:
However, one confusing segment of the Wikipedia entry for a prior is:
"of an uncertain quantity p (for example, suppose p is the proportion of voters who will vote for the politician named Smith in a future election) is the probability distribution that would express one's uncertainty about p before the "data" (for example, an opinion poll) is taken into account."
That seems to suggest that we can't take *any* data into account in order to find it. Don't we then just have to guess? Sounds like more reading may be in order...
It seems the prior (or prior probability) is the measure of uncertainness of an event without taking any evidence (specific features) into account.
Apparently the posterior (or again, posterior probability) is the conditional probability assigned after relevant evidence is taken into account.
So, now to construct an example that illustrates what I currently believe about these concepts: If, in a given corpus, 50% of the tokens are determiners, then the chance of selecting a token you know nothing about it and finding it to be a determiner is 50%. I believe that's the prior. However, if 70% of tokens occurring after verbs are determiners, then the posterior probability is the conditional probability P(determiner|verb) ["probability of a determiner given a verb"], so 70%.
In Bayes Theorem, which is the one part of Bayesian anything that I am one might almost say *too* familiar with, the prior is multiplied by the likelihood function and then normalized to obtain the posterior. So:
or, equivalently:
However, one confusing segment of the Wikipedia entry for a prior is:
"of an uncertain quantity p (for example, suppose p is the proportion of voters who will vote for the politician named Smith in a future election) is the probability distribution that would express one's uncertainty about p before the "data" (for example, an opinion poll) is taken into account."
That seems to suggest that we can't take *any* data into account in order to find it. Don't we then just have to guess? Sounds like more reading may be in order...
Sunday, July 17, 2011
Short vs. long papers
I'm confused about short vs. long papers. I suppose a publication in a given venue is a publication in a given venue, but are long papers more prestigious than short papers?
This varies a lot. It seems like in typical conferences, it's 4 vs. 8 pages. For RANLP, it's 7 vs 8 pages, so when I had a paper accepted as a short paper, I just had to shorten it a page. For IWPT, it's 4 vs 10! A 6-page difference?
I wish I had a wide readership I could query about their thoughts on this...
This varies a lot. It seems like in typical conferences, it's 4 vs. 8 pages. For RANLP, it's 7 vs 8 pages, so when I had a paper accepted as a short paper, I just had to shorten it a page. For IWPT, it's 4 vs 10! A 6-page difference?
I wish I had a wide readership I could query about their thoughts on this...
Saturday, July 16, 2011
Cluster, I hate you.
Working with huge datasets as I do, I have to use my school's cluster computing environment, which is really different from any other computer with which I am familiar. For example, in order to use programs such as emacs, svn, or java, I have to enable them using SoftEnv. This has to be done either each time you log on or with a startup script. I've been doing the former since basically all I use are emacs, svn, and java, and usually not all every time I log in, but planning to do the latter eventually... (Hint: This is foreshadowing about how my laziness may have been my salvation.)
Yesterday, late at night, I was struggling to do something I had done successfully before and kept getting bizarre errors. I finally tried unsuccessfully one last time to run my script, gave up and went to bed. This morning I began to tackle the problem again. My first step was to verify the error by rerunning the script.
Um....what error? Errors, I hate you when you exist, but I hate you even more when by not existing, you make me look crazy.
When I had slept and could think clearly, I found the problem was that, since I had to add java yesterday in order to use javac, it messed up my ability to use a particular jarfile by being the wrong version. Java versions, I hate you.
The lessons I learn from this are: First, on the cluster, if I'm having problems running something I have run before, I should log on in a new shell and see if that fixed the problem. Second, if I decrease my announced memory requirements and do all my processing at night, my process won't sit in the queue as long, so I'll find out more quickly if there are any immediate problems I need to solve.
Yesterday, late at night, I was struggling to do something I had done successfully before and kept getting bizarre errors. I finally tried unsuccessfully one last time to run my script, gave up and went to bed. This morning I began to tackle the problem again. My first step was to verify the error by rerunning the script.
Um....what error? Errors, I hate you when you exist, but I hate you even more when by not existing, you make me look crazy.
When I had slept and could think clearly, I found the problem was that, since I had to add java yesterday in order to use javac, it messed up my ability to use a particular jarfile by being the wrong version. Java versions, I hate you.
The lessons I learn from this are: First, on the cluster, if I'm having problems running something I have run before, I should log on in a new shell and see if that fixed the problem. Second, if I decrease my announced memory requirements and do all my processing at night, my process won't sit in the queue as long, so I'll find out more quickly if there are any immediate problems I need to solve.
Tuesday, June 28, 2011
Mouse vs. keyboard vs....joystick.
My first computer was a Commodore 64, and I've never really liked the mouse. I do chuckle along with some people when I explain emacs commands, but I love 'em! (Although, a quick aside, I think that when I am using a mouse, I'd actually like one for each hand. And I love the idea that the joystick was initially intended to be a productivity tool. I have no idea why it failed where the mouse succeeded.)
Anyway, here's an interesting article, courtesy of itworld.com via slashdot.org, about 7 days without a mouse.
Anyway, here's an interesting article, courtesy of itworld.com via slashdot.org, about 7 days without a mouse.
Penn treebank standard training/development/test divisions
The Penn Treebank is often used in natural language processing tasks. Typically, I use 10-fold cross validation, but there is a standard test set that is often used in the parsing community. Sections 02-21 are the training set, 23 is the development set, and 24 is the test set.
Using this, and the connl format (1 token per line, a blank line between sentences), there are 1,088,220 lines in the training set, 41,821 in the development set, and 59,118 in the test set. That means the test set is about 5.4% of the training set. It's good to have as much training data as possible, but that makes for a really large training set. That's going to require some very memory-efficient machine learning...
Using this, and the connl format (1 token per line, a blank line between sentences), there are 1,088,220 lines in the training set, 41,821 in the development set, and 59,118 in the test set. That means the test set is about 5.4% of the training set. It's good to have as much training data as possible, but that makes for a really large training set. That's going to require some very memory-efficient machine learning...
Machine learning with large datasets
I use a trainable dependency parser to produce semantics of natural language expressions, but I typically use fairly small datasets in order to stay focused on the particular language phenomena of interest. I'm trying to scale my dataset up, and I'm finding there's a whole set of issues I have to deal with. In particular, of course, the larger the dataset the more memory it sucks up.
I'm trying to use the penn treebank, which is basically huge. MaltParser uses TiMBL; I'm using Weka (which I chose entirely because it was the easiest Java machine-learning package to use); I wonder if there's any super-memory-efficient machine learning library? Or do I just really need to sample my dataset? Or run it on IU's cluster?
The basic problem is this: Natural language is hard. To do well, you need a huge dataset. However, machine learning is intense, so you need a huge computer. That means the robot needs access to a huge computer. However, if the computer is running a scheduling system....there's going to be a real-time problem.
I'm trying to use the penn treebank, which is basically huge. MaltParser uses TiMBL; I'm using Weka (which I chose entirely because it was the easiest Java machine-learning package to use); I wonder if there's any super-memory-efficient machine learning library? Or do I just really need to sample my dataset? Or run it on IU's cluster?
The basic problem is this: Natural language is hard. To do well, you need a huge dataset. However, machine learning is intense, so you need a huge computer. That means the robot needs access to a huge computer. However, if the computer is running a scheduling system....there's going to be a real-time problem.
Monday, June 27, 2011
ant buildfile
I am not too familiar with ant, but today I successfully managed to modify someone's ant buildfile to see my very own files:
The original file had this:
<path id="compile.classpath">
<fileset dir="${lib}">
<include name="weka.jar"/>
<include name="postaipc-0.8.5.jar"/>
<include name="clipc-0.2.jar"/>
</fileset>
</path>
and I simply modified it to:
<path id="compile.classpath">
<fileset dir="${lib}">
<include name="weka.jar"/>
<include name="postaipc-0.8.5.jar"/>
<include name="clipc-0.2.jar"/>
</fileset>
<pathelement location="/home/me/myfiles"/>
</path>
The original file had this:
<path id="compile.classpath">
<fileset dir="${lib}">
<include name="weka.jar"/>
<include name="postaipc-0.8.5.jar"/>
<include name="clipc-0.2.jar"/>
</fileset>
</path>
and I simply modified it to:
<path id="compile.classpath">
<fileset dir="${lib}">
<include name="weka.jar"/>
<include name="postaipc-0.8.5.jar"/>
<include name="clipc-0.2.jar"/>
</fileset>
<pathelement location="/home/me/myfiles"/>
</path>
Thursday, June 23, 2011
Using System.out.print(ln?)()
I often use System.out.println and its sister, System.err.println.
However, I'd like to be able to easily switch the type of output my program emits, between not just stdout and a file, but also different formats, each of which would subclass BufferedWriter in order to produce the write format.
I don't know though, because then I'm either trying to return strings from each of my classes, or each class has to have access to the output stream, or I end up passing the output stream around a lot.
However, I'd like to be able to easily switch the type of output my program emits, between not just stdout and a file, but also different formats, each of which would subclass BufferedWriter in order to produce the write format.
I don't know though, because then I'm either trying to return strings from each of my classes, or each class has to have access to the output stream, or I end up passing the output stream around a lot.
Wednesday, June 22, 2011
Proof that LaTex is far superior to any WYSIWYG slide creator
After creating the graphics for my animation, easily and with no swearing at all, as described in the previous blog post. I was faced with the challenge of putting them into OpenOffice.
I thought for a minute, and cried to myself, "God I wish I were doing the slides in LaTex!"
Why?
Because this would be enough to create my slides:
\documentclass{beamer}
\mode<presentation>
\usepackage{graphicx}
\newcommand{\animatestep}[1]{\begin{frame}\frametitle{Step #1}\begin{center}\includegraphics{graphic-#1-crop.pdf}
\end{center}\end{frame}}
\begin{document}
\animatestep{1}
\animatestep{2}
\animatestep{3}
\animatestep{4}
\animatestep{5}
\animatestep{6}
\animatestep{7}
\animatestep{8}
\animatestep{9}
\animatestep{10}
\animatestep{11}
\animatestep{12}
\end{document}
And best of all, having once done this and noticed a flaw in the basic code I used to create this (I forgot to turn gray on for the arc labels), I just had to correct the flaw with a quick insertion of \g into each slide and run my compile command again:
> for next in 0 1 2 3 4 5 6 7 8 9 10 11; do latex graphic-$next.tex; dvipdf graphic-$next.dvi;\
pdfcrop graphic-$next.pdf; done
> pdflatex test-animation.tex
and the whole thing is fixed. In OpenOffice, I would have had to manually replace each graphic with the corrected one.
That determined me! From now on, I'm creating a LaTex mockup of my slides before I create the final version in OpenOffice!
I thought for a minute, and cried to myself, "God I wish I were doing the slides in LaTex!"
Why?
Because this would be enough to create my slides:
\documentclass{beamer}
\mode<presentation>
\usepackage{graphicx}
\newcommand{\animatestep}[1]{\begin{frame}\frametitle{Step #1}\begin{center}\includegraphics{graphic-#1-crop.pdf}
\end{center}\end{frame}}
\begin{document}
\animatestep{1}
\animatestep{2}
\animatestep{3}
\animatestep{4}
\animatestep{5}
\animatestep{6}
\animatestep{7}
\animatestep{8}
\animatestep{9}
\animatestep{10}
\animatestep{11}
\animatestep{12}
\end{document}
And best of all, having once done this and noticed a flaw in the basic code I used to create this (I forgot to turn gray on for the arc labels), I just had to correct the flaw with a quick insertion of \g into each slide and run my compile command again:
> for next in 0 1 2 3 4 5 6 7 8 9 10 11; do latex graphic-$next.tex; dvipdf graphic-$next.dvi;\
pdfcrop graphic-$next.pdf; done
> pdflatex test-animation.tex
and the whole thing is fixed. In OpenOffice, I would have had to manually replace each graphic with the corrected one.
That determined me! From now on, I'm creating a LaTex mockup of my slides before I create the final version in OpenOffice!
Creating open-office slides from a LaTex presentation
While my preferred method of creating slides is, naturally, using (excuse me, "utilizing") LaTex, there are times when it is required, for reasons of organizational consistency to use Open Office and a pre-defined template to create slides. Typically, the slides are for a paper for which I have already spent hours creating dependency graphics. In the past, slide-creation time has then required yet more hours (okay, "hours" is an exaggeration, but I stand by the sentiment it expresses) of dependency-graphic creation.
This is one reason why I was so excited to finally master the creation of pdf images, which I now intend to use in my slides! We'll see how it works!
My requirements are:
1) I must be able to seamlessly animate the progress of the dependency graph. In other words, from one slide to the next, the graphics must be placed identically so that it seems that, instead of changing slides, I have simple added a new word or arc to the initial slide as an animation would do.
2) The graphs must visually cohere with my template.
3) Creating them must not be labor-intensive.
I think that's all. For now.
Implementation of visual coherence:
First, the easy part. The font style is bitstream vera sans, which has been implemented as a package called bera:
\usepackage{bera}
This activates the vera fonts, but now, you need to tell it to use sans serif instead of serif:
\renewcommand{\familydefault}{\sfdefault}
Next I need to color my fonts 80\% gray:
\usepackage{xcolor}
\color{gray!80} this is the text!
and resize it to 24pt:
\fontsize{24pt}{24pt}
\selectfont
this is the text!
So far so good.
The font carries over into the psmatrix environment, and I can use the coloration with it (though it can't span an ampersand, so I have to color each word separately).
The arcs are colored by a command, unearthed from deep within the xcolor manual (http://mirror.ctan.org/macros/latex/contrib/xcolor/xcolor.pdf):
Finally, as I mentioned, the coloring does not cross ampersand boundaries, but fortunately all the arc labels are actually considered to be within the "group" comprising the last box of the matrix row, so the arc labels, if left alone, will be the color of the last word in the utterance.
I have now fully formatted my graphic like the slide I am hoping to match!
Implementation of animation:
I have done a lot of pseudo-animation in both open office and powerpoint, and the way I typically add things is:
This has two advantages:
So now, I have a neatly animated set of pdf graphics representing the stepwise creation of a dependency graph, which I created in text mode. (Have I mentioned how much I prefer using text to a gui? :D This proves the age-old adage: you can take the woman away from the Commodore 64, but you can't take the Commodore 64 out of the woman!)
This is one reason why I was so excited to finally master the creation of pdf images, which I now intend to use in my slides! We'll see how it works!
My requirements are:
1) I must be able to seamlessly animate the progress of the dependency graph. In other words, from one slide to the next, the graphics must be placed identically so that it seems that, instead of changing slides, I have simple added a new word or arc to the initial slide as an animation would do.
2) The graphs must visually cohere with my template.
3) Creating them must not be labor-intensive.
I think that's all. For now.
Implementation of visual coherence:
First, the easy part. The font style is bitstream vera sans, which has been implemented as a package called bera:
\usepackage{bera}
This activates the vera fonts, but now, you need to tell it to use sans serif instead of serif:
\renewcommand{\familydefault}{\sfdefault}
Next I need to color my fonts 80\% gray:
\usepackage{xcolor}
\color{gray!80} this is the text!
and resize it to 24pt:
\fontsize{24pt}{24pt}
\selectfont
this is the text!
So far so good.
The font carries over into the psmatrix environment, and I can use the coloration with it (though it can't span an ampersand, so I have to color each word separately).
The arcs are colored by a command, unearthed from deep within the xcolor manual (http://mirror.ctan.org/macros/latex/contrib/xcolor/xcolor.pdf):
- \psset{linecolor=green!50}
- \psset{linecolor=[rgb]{0.5,1,0.5}}
- \psframebox[linecolor={[rgb]{0.5,1,0.5}}]{foo}
Finally, as I mentioned, the coloring does not cross ampersand boundaries, but fortunately all the arc labels are actually considered to be within the "group" comprising the last box of the matrix row, so the arc labels, if left alone, will be the color of the last word in the utterance.
I have now fully formatted my graphic like the slide I am hoping to match!
Implementation of animation:
I have done a lot of pseudo-animation in both open office and powerpoint, and the way I typically add things is:
- create my finished graphic using text and arcs. (It's important that my "graphics" are never real graphics, merely lines of text adorned by arcs.)
- make the parts that should appear last match the background color, effectively making them invisible
This has two advantages:
- I create the finished "graph"-ic, copy it, change the copy so it looks like the step before the final graphic, copy this, etc, which is basically the least labor-intensive way of creating these graphics
- If the graphic is centered, just erasing the parts that shouldn't have appeared yet will make it so that the visible graphics will not be placed correctly. Having the exact same content, just colored differently, ensures a perfect placement.
So now, I have a neatly animated set of pdf graphics representing the stepwise creation of a dependency graph, which I created in text mode. (Have I mentioned how much I prefer using text to a gui? :D This proves the age-old adage: you can take the woman away from the Commodore 64, but you can't take the Commodore 64 out of the woman!)
Friday, June 17, 2011
Creating a LaTex Stylesheet or Package
This turns out to be quite simple:
You create a preamble, complete with \{newcommand}s and \{usepackage}s and \{definecolor}s, then you remove all this premable into a file with the extension .sty, e.g. test.sty. Thereafter, you simply \usepackage{test}.
Why should you do this?
Well, if you're like me, you use the same formatting in paper after paper. Do this to the semantics! Do this to robot names! Do this to natural language input! So if you don't have a customized package, you'll probably end up cutting and pasting from one paper to the next. Cutting and pasting is as bad in paper creation as it is in software creation!
You create a preamble, complete with \{newcommand}s and \{usepackage}s and \{definecolor}s, then you remove all this premable into a file with the extension .sty, e.g. test.sty. Thereafter, you simply \usepackage{test}.
Why should you do this?
Well, if you're like me, you use the same formatting in paper after paper. Do this to the semantics! Do this to robot names! Do this to natural language input! So if you don't have a customized package, you'll probably end up cutting and pasting from one paper to the next. Cutting and pasting is as bad in paper creation as it is in software creation!
Why create customized commands and environments in LaTex?
LaTex is meant to be used to separate form (essentially, formatting) from content. It contains plenty of commands for formatting -- otherwise you couldn't format your content at all. The quickest path to formatted content is to combine your formatting and content. Basically, to format as you go along. So I often see people just use the formatting commands provided throughout their paper. (My sample is purely academic users, who are using LaTex almost exclusively to produce papers for academic publication.) It doesn't take too much thinking to figure out that this goes against the entire basis for using LaTex. Not only that, it's inefficient, especially for academic papers.
Papers typically need to be an extremely certain length, and meet other very specific formatting criteria. As a result, the last step to producing almost any paper is to make sure those criteria are met. One common result is that some time is wasted going through the paper adding and removing \small{}s and other size-related commands.
Also, many papers have multiple coauthors. This can result in formatting inconsistencies. You may choose to italicize natural language input, while your coauthor puts it in a verbatim environment.
The key to using LaTex efficiently and consistently is to create commands that describe the type of content, and apply the formatting. This allows you to format consistently by changing it in one place. It also effectively separates form and content. For example, in my papers, I use a lot of dialogues, essentially scripts describing what the human said to the robot and vice versa. This is how I format them generically:
\newenvironment{dialog}
{\begin{alltt}}
{\end{alltt}}
and this is how I formatted them in the case of a specific paper with specific formatting requirements:
\newenvironment{dialog}
{\vspace{-2mm}\begin{small}\begin{alltt}}
{\end{alltt}\end{small}\vspace{-2mm}}
Papers typically need to be an extremely certain length, and meet other very specific formatting criteria. As a result, the last step to producing almost any paper is to make sure those criteria are met. One common result is that some time is wasted going through the paper adding and removing \small{}s and other size-related commands.
Also, many papers have multiple coauthors. This can result in formatting inconsistencies. You may choose to italicize natural language input, while your coauthor puts it in a verbatim environment.
The key to using LaTex efficiently and consistently is to create commands that describe the type of content, and apply the formatting. This allows you to format consistently by changing it in one place. It also effectively separates form and content. For example, in my papers, I use a lot of dialogues, essentially scripts describing what the human said to the robot and vice versa. This is how I format them generically:
\newenvironment{dialog}
{\begin{alltt}}
{\end{alltt}}
and this is how I formatted them in the case of a specific paper with specific formatting requirements:
\newenvironment{dialog}
{\vspace{-2mm}\begin{small}\begin{alltt}}
{\end{alltt}\end{small}\vspace{-2mm}}
Thursday, June 16, 2011
Creating image files from pstricks code
Yesterday I explained how to create a dependency graph in LaTex using pstricks. It is simple to include the code in your pdf file directly, but there are two major problems with this approach:
Follow the steps in my previous post. You'll now have a tex file. You don't want page numbers, so make sure the preamble includes the following command:
\pagestyle{empty}
2. Produce the pdf file. It will be a full-size page with excess white space.
You can't use pdflatex with pstricks; instead use
latex example.tex
dvipdf example.dvi
This will produce the output example.pdf. It's a full-size page including excess white space.
3. Crop the excess white space.
The command
pdfcrop example.pdf
will create the output file example-crop.pdf.
4. Include the cropped output file in your destination file.
First, add to the preamble
\usepackage{graphicx}
\begin{center}
\includegraphics[]{example-crop.pdf}
\end{center}
5. Compile your final file. Unless something else is preventing you, you can now use pdflatex to compile the final file.
You should now have created a dependency graph as a stand-alone file, and included it in a separate pdf file! It took me a long time to get all these steps. I was stuck on trying to use dvipng, which I was not able to get to recognize the arcs in the dependency graph. So I hope this prevents someone else from having the same trouble!
- You must then use latex and dvipdf in order to compile the file, rather than just pdflatex.
- It lacks modularity. This is particularly problematic when you are not completely sure what the finished graph should look like. I often create a graph, decide to try something else, comment out all or part of the graph and make my changes. This results in a lot of commented-out code, particularly if I add comments to explain what differentiates one graph from another. It would be cleaner to create a graphic with a descriptive name and include that descriptively-named graphic in a file. Then, even if I decide to try something different, I can simply create a second graphic, and my text file remains much cleaner.
- create the graphic normally as you would any in any pdf file
- produce the pdf file. It will be a full-size page with excess white space.
- crop the excess white space.
- include the cropped output file in your destination file.
Follow the steps in my previous post. You'll now have a tex file. You don't want page numbers, so make sure the preamble includes the following command:
\pagestyle{empty}
2. Produce the pdf file. It will be a full-size page with excess white space.
You can't use pdflatex with pstricks; instead use
latex example.tex
dvipdf example.dvi
This will produce the output example.pdf. It's a full-size page including excess white space.
3. Crop the excess white space.
The command
pdfcrop example.pdf
will create the output file example-crop.pdf.
4. Include the cropped output file in your destination file.
First, add to the preamble
\usepackage{graphicx}
\begin{center}
\includegraphics[]{example-crop.pdf}
\end{center}
5. Compile your final file. Unless something else is preventing you, you can now use pdflatex to compile the final file.
You should now have created a dependency graph as a stand-alone file, and included it in a separate pdf file! It took me a long time to get all these steps. I was stuck on trying to use dvipng, which I was not able to get to recognize the arcs in the dependency graph. So I hope this prevents someone else from having the same trouble!
Wednesday, June 15, 2011
Dependency Graphs
I have to create a lot of dependency graphs for papers, and I've tried three different methods with success:
- the LaTex pstricks package - This is my favorite because it allows you to type a series of commands directly into your paper, and you end up with a beautiful graph. You can create a lovely graph and view it in pdf without ever lifting your hands from the keyboardr. Better yet, the more you know, the more beautiful a graph you can produce! I learned this by emulating the code of my Computational Linguistics professor, Dr. Sandra Kuebler. She knows a lot about LaTex. Sadly, my Cognitive Science professor, Dr. Matthias Scheutz, says pstricks is a "really old package" that "no one uses anymore", requiring me to use...
- OmniGraffle - I also love this method. It's laborious, but you can create your graphs without ever lifting your hands from the mouse, and you hardly have to know anything to produce beautiful graphs. Sadly, this costs money, and I'm a poor student, necessitating me to use...
- dia - the free open-source version of OmniGraffle.
Draw your desired dependency graph by hand. This will make entering it into the computer much easier.
Set up your document to accept the graph by making sure you are importing the correct packages. Most likely, you'll need the following in your preamble:
\usepackage{pstricks,pst-node,pst-tree,pstricks-add}
Create the proper environment, which consists of two pstricks environments that you'll create inside a normal figure environment:
\begin{figure}
\begin{pspicture}(x,y)
\begin{psmatrix}[nodesep=z, rowsep=q, colsep=r]
...
\end{psmatrix}
\end{pspicture}
\end{figure}
(x,y) are the desired width and height of the figure you'll be creating, as dimensionless units. Playing with the width when the pspicture is centered within the figure environment had the effect of offsetting center, so leave this 0, but it has no effect on any border. Playing with the height actually changes the height of a border around the graph. If you do have a border, it is possible to make the size small enough that the border falls on the graph, rather than around it.
z, q, r are parameters in the formed of dimensioned units, e.g. 2pt, 0.5cm, or 0.3cm [the values I used in the example I copied these instructions from]
Rowsep is only important if you have multiple rows. Just play with these values until things are placed as you like them.
Rowsep is only important if you have multiple rows. Just play with these values until things are placed as you like them.
Type in the sentence to graph as the first line in the psmatrix. Since it's a matrix, they'll be separated by ampersands rather than spaces:
point&to&the&table&with&the&blue&pen
Set the default values.
Arrow direction: Decide whether you'd rather chant to yourself: "2 from 1", or "1 to 2". As I'm transferring my graph, I look at an arc, chant one of the above while I enter it into the computer. If you would rather start with the node the arrow is pointing towards, your direction will be "<-". If you would rather start with the node the arrow is pointing from, your direction will be "->":
\psset{arrows=<-}
Angle:
\psset{angle=90}
Default arc height: If you want the default to be the shortest, choose 20pt. If you want it to be medium, chose 40pt.
\psset{arm=20pt}
Count the number of arcs you will need to enter, and begin entering them. For each arc, add the following line:
\ncbar[]{1,}{1,}\nbput{\small }
when you have a completed arc, the line will look something like:
\ncbar[]{1,x}{1,y}\nbput{\small z}
where x is the node you start with, so the one the arrow is pointing towards or from, depending on the direction you selected, y is the second node, and z is the label of the arc. Inside the square brackets will be options such as how the arrow is centered horizontally above the node, and how tall the arrow is.
For each arc, enter the node numbers and the label.
\ncbar[]{1,2}{1,1}\nbput{\small obj}
For me, this created an object arc to node 2 from node 1.
Enter the arc heights.
If you have multiple arcs to, or from a given node, they will need to be different heights so the arrows are clearly differentiated from each other. I do this starting from the end by visually selecting in turn each node from that has multiple arrows leaving or arriving at it. If your graph does not contain crossing dependencies, some of the arcs will be "inside" others:
[insert picture]
make the inside 1 20pt, and add 20pts for each enveloping arc.
Enter this informationi so that the lines look something like:
\ncbar[arm=60pt ]{1,8}{1,5}\nbput{\small obj}
\ncbar[arm=40pt ]{1,6}{1,8}\nbput{\small obj}
\ncbar[arm=20pt ]{1,7}{1,8}\nbput{\small obj}
Set the horizontal offsets. The arrows will all start from and arrive at the same point on the node unless you offset some of the arrows. There's an offset for the first node listed [offsetA] and the second node listed[offsetB].
So in the line
\ncbar[offsetB=-3pt]{1,8}{1,7}\nbput{\small }
The arrow pointing to node 7 is offset by -3pt.
For offsetA, a negative number moves it right, while for offsetB, a negative number moves it left.
So in the line
\ncbar[offsetB=-3pt]{1,8}{1,7}\nbput{\small }
The arrow pointing to node 7 is offset by -3pt.
For offsetA, a negative number moves it right, while for offsetB, a negative number moves it left.
Enter the offsets like this:
\ncbar[arm=60pt,offsetA=1.5pt]{1,5}{1,8}\nbput{\small }
\ncbar[arm=40pt,offsetB=-3pt]{1,8}{1,6}\nbput{\small }
\ncbar[offsetB=0pt]{1,8}{1,7}\nbput{\small }
If you use spaces instead of commas, the words after the first space will all be listed under the graph instead of being executed in the graph.
Compile. You won't be able to use pdflatex anymore, instead you'll have to use latex followed by dvipdf:
latex example.tex
dvipdf example.dvi
Unfortunately, if you are sharing a tex file with someone, say over subversion, they may be annoyed that you have now broken pdflatex. They may be in a position to insist you do not do this (see above). How could this be fixed? If you can produce a graphic from a file with just the information above, and then insert that file into a document. I haven't figured out how to do it, but I'd sure like to!
Monday, June 13, 2011
Incremental processing headaches
Being in Natural Language Human-Robot Interaction (NLHRI), we do incremental processing. It increases speed by allowing processes to occur in parallel, and allows cool things like the robot looking at what you're talking about *while* you're saying it.
However, the biggest headache is that most NLP software isn't meant to be incremental, it's still mostly structured around the idea of reading in a whole document, processing the entire document, and then producing some results. Even algorithms that would be perfectly fine incrementally typically aren't programmed so they're easy to use that way. This results in a lot of needless duplication of effort when I finally get frustrated attempting to increment a non-incremental program and just write my own.
However, the biggest headache is that most NLP software isn't meant to be incremental, it's still mostly structured around the idea of reading in a whole document, processing the entire document, and then producing some results. Even algorithms that would be perfectly fine incrementally typically aren't programmed so they're easy to use that way. This results in a lot of needless duplication of effort when I finally get frustrated attempting to increment a non-incremental program and just write my own.
CoreNLP
This java program from Stanford University is a full-service plain-text-to-parsed-sentence utility. It takes an input phrase and finds part-of-speech tags, lemmas, sentence boundaries, and named entities, and parses the sentence. It also finds coreferents, but they appear by themselves at the bottom of the XML file and are a bunch of obscure numbers. The other results were quite simple to read. It also took 11 minutes to process two short sentences, and used so much of my system that everything slowed down massively.
If, like me, all you want to do is lemmatize, it took only 5.3 seconds for those same two sentences. Much better!
Final output is a nice xml format. So far I'm liking this for my very initial use of a lemmatizer (I can add the morphological analysis later), as long as I can figure out how to arrest the output while it's still a data structure and output it how I want to output it...
It took a while to find the javadocs online: if you google for them, they aren't anywhere high on the list, and there isn't an obvious link to them on the Stanford page. Finally, I saw a tiny link at the bottom of the page and clicked on it, and by switching back and forth several times between those and a snippet of code on the CoreNLP page (which includes no import paths), I finally started to get my bearings and be able to use the classes. It could have been easier, though.
Once you get started with it, it's fairly straightforward to get a simple class running that does what you want it to do. Which in my case is lemmatize. Next up: can I figure out how to replace its requirement of POS-tagging before it lemmatizes, with receiving my POS tags and using them to lemmatize?
If, like me, all you want to do is lemmatize, it took only 5.3 seconds for those same two sentences. Much better!
Final output is a nice xml format. So far I'm liking this for my very initial use of a lemmatizer (I can add the morphological analysis later), as long as I can figure out how to arrest the output while it's still a data structure and output it how I want to output it...
It took a while to find the javadocs online: if you google for them, they aren't anywhere high on the list, and there isn't an obvious link to them on the Stanford page. Finally, I saw a tiny link at the bottom of the page and clicked on it, and by switching back and forth several times between those and a snippet of code on the CoreNLP page (which includes no import paths), I finally started to get my bearings and be able to use the classes. It could have been easier, though.
Once you get started with it, it's fairly straightforward to get a simple class running that does what you want it to do. Which in my case is lemmatize. Next up: can I figure out how to replace its requirement of POS-tagging before it lemmatizes, with receiving my POS tags and using them to lemmatize?
MorphAdorner
I got this up and running fairly quickly. It came in a nice jar file, and at first I was confused about the input format because it seemed that it would accept only XML format, but I couldn't figure out how to structure it. Finally, all else failed and I looked in the user manual, which told me how to run it on a plain text example. I quickly patched together a test text file featuring the sentence:
This is a test of using Morphadorner to adorn plain english (modern) texts.
ran it, and got the following output:
This This d This this 0
is is vbz is be 0
a a dt a a 0
test test n1 test test 0
of of pp-f of of 0
using using vvg using use 0
Morphadorner Morphadorner n1 Morphadorner morphadorner 0
to to pc-acp to to 0
adorn adorn vvi adorn adorn 0
plain plain j plain plain 0
english english n1 english english 0
( ( ( ( ( 0
modern modern j modern modern 0
) ) ) ) ) 0
texts texts n2 texts text 0
. . . . . 1
Somewhere in the user manual I found an explanation, not of the fields themselves in field order, but the sort of information that the fields might contain, and was able to match each field to the matching definition.
It was then that I realized a lemmatizer is not a morphological analyzer...
This is a test of using Morphadorner to adorn plain english (modern) texts.
ran it, and got the following output:
This This d This this 0
is is vbz is be 0
a a dt a a 0
test test n1 test test 0
of of pp-f of of 0
using using vvg using use 0
Morphadorner Morphadorner n1 Morphadorner morphadorner 0
to to pc-acp to to 0
adorn adorn vvi adorn adorn 0
plain plain j plain plain 0
english english n1 english english 0
( ( ( ( ( 0
modern modern j modern modern 0
) ) ) ) ) 0
texts texts n2 texts text 0
. . . . . 1
Somewhere in the user manual I found an explanation, not of the fields themselves in field order, but the sort of information that the fields might contain, and was able to match each field to the matching definition.
It was then that I realized a lemmatizer is not a morphological analyzer...
Lemmatizers and Morphological Analyzers, Part 2
This post is beginning as an initial list of lemmatizers and morphological analyzers to try, and will morph over time into reviews of those lemmatizers.
Claimed lemmatizers:
Claimed lemmatizers:
- MorphAdorner (http://morphadorner.northwestern.edu/)
- FreeLing (http://nlp.lsi.upc.edu/freeling/)
- NLTK (http://www.nltk.org/)
- CoreNLP (http://nlp.stanford.edu/software/corenlp.shtml)
- GATE (http://gate.ac.uk/)
- ENGTWOL (http://www2.lingsoft.fi/cgi-bin/engtwol?word=was)
- mmorph (http://aune.lpl.univ-aix.fr/projects/multext/)
- PC-KIMMO (http://www.sil.org/pckimmo/about_pc-kimmo.html)
- Morpha (http://www.informatics.sussex.ac.uk/research/groups/nlp/carroll/morph.html)
- LingPipe (http://alias-i.com/lingpipe/)
Lemmatizers and Morphological Analyzers
I have done a *lot* of part-of-speech tagging and parsing, but I haven't quite made the jump into lemmatization...till now, that is.
I've known for a long time that I really need to add a lemmatizer to my pipeline, so I thought now is a great time. I just finished up a big video demo, my system's been improved a lot in some exciting ways, now I really need to work on robustness. And for the system to be robust, it needs to have a way of figuring out that "is" is related to "are" and "be".
Despite my incessant (and inefficient) desire to create my own tools (a desire which, sadly, is *not* supported by my willingness to drive a software development project to release-ready completion), I'm looking for a pre-existing tool that meets some constraints.
I've known for a long time that I really need to add a lemmatizer to my pipeline, so I thought now is a great time. I just finished up a big video demo, my system's been improved a lot in some exciting ways, now I really need to work on robustness. And for the system to be robust, it needs to have a way of figuring out that "is" is related to "are" and "be".
Despite my incessant (and inefficient) desire to create my own tools (a desire which, sadly, is *not* supported by my willingness to drive a software development project to release-ready completion), I'm looking for a pre-existing tool that meets some constraints.
- Ideally, it should be in Java, since most of our system is. Sure, there are a few parts that are in other languages, mostly C, but everyone in the lab has to know Java, so I know that the Java parts of our system will be maintainable in the future. The Haskall POS tagger we acquired last semester, not so much.
- It needs to preserve the syntactic information. When I first started this, I actually believed that was typical. I had many times in a paper seen a form like: is+PST representing the lemma for was. However, I've come to understand that this is not necessarily the case. For example, I tried Morphadorner first, and at least in its default settings, it does not preserve that information, producing only be, not be+PST. My understanding, after some research, is that this is typical of morphological analyzers, not lemmatizers. However, I will review both lemmatizers and morphological analyzers, since my extremely brief survey leads me to believe that there are more freely-available lemmatizers than morphological analyzers.
- It should be trainable. While I enjoy using systems that come pre-trained so I can get started using them right away before I've taken the time to figure out just how our corpus needs to be formatted, any domain has its quirks that can only be adequately handled by a system trained on that domain. This is likely true of most systems, but I plan to discard the odd rule-based system out-of-hand, if any exist.
Subscribe to:
Posts (Atom)