Tuesday, June 28, 2011

Machine learning with large datasets

I use a trainable dependency parser to produce semantics of natural language expressions, but I typically use fairly small datasets in order to stay focused on the particular language phenomena of interest. I'm trying to scale my dataset up, and I'm finding there's a whole set of issues I have to deal with. In particular, of course, the larger the dataset the more memory it sucks up.

I'm trying to use the penn treebank, which is basically huge. MaltParser uses TiMBL; I'm using Weka (which I chose entirely because it was the easiest Java machine-learning package to use); I wonder if there's any super-memory-efficient machine learning library? Or do I just really need to sample my dataset? Or run it on IU's cluster?

The basic problem is this: Natural language is hard. To do well, you need a huge dataset. However, machine learning is intense, so you need a huge computer. That means the robot needs access to a huge computer. However, if the computer is running a scheduling system....there's going to be a real-time problem.

No comments:

Post a Comment