Tuesday, June 28, 2011

Penn treebank standard training/development/test divisions

The Penn Treebank is often used in natural language processing tasks. Typically, I use 10-fold cross validation, but there is a standard test set that is often used in the parsing community. Sections 02-21 are the training set, 23 is the development set, and 24 is the test set.

Using this, and the connl format (1 token per line, a blank line between sentences), there are 1,088,220 lines in the training set, 41,821 in the development set, and 59,118 in the test set. That means the test set is about 5.4% of the training set. It's good to have as much training data as possible, but that makes for a really large training set. That's going to require some very memory-efficient machine learning...

No comments:

Post a Comment