Limitations of co-training for natural language learning from large datasets. David Pierce and Claire Cardie. In Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing (EMNLP-2001).

Co-Training is a weakly supervised learning paradigm in which the redundancy of the learning task is captured by training two classifiers using separate views of the same data. This enables bootstrapping from a small set of labeled training data via a large set of unlabeled data. This study examines the learning behavior of co-training on natural language processing tasks that typically require large numbers of training instances to achieve usable performance levels. Using base noun phrase bracketing as a case study, we find that co-training reduces by 36% the difference in error between co-trained classifiers and fully supervised classifiers trained on a labeled version of all available data. However, degradation in the quality of the bootstrapped data arises as an obstacle to further improvement. To address this, we propose a moderately supervised variant of co-training in which a human corrects the mistakes made during automatic labeling. Our analysis suggests that corrected co-training and similar moderately supervised methods may help co-training scale to large natural language learning tasks.
[abstract, ps, pdf]