Toward Unsupervised Whole-Corpus Tagging

Citation

Freitag D. Toward Unsupervised Whole-Corpus Tagging, in Proceedings of Coling 2004, 2004.

Abstract

We present a system for unsupervised tagging of words into classes produced by a distributional clustering technique called co-clustering. A hidden Markov model (HMM) trained on the high frequency terms in the lexicon, is used to tag occurrences of low frequency terms. In experiments using the Wall Street Journal portion of the Penn Treebank, we show that previously reported problems in using Baum-Welch estimation for part-of-speech-tagging do not occur in this context. We also show how state-level term emission models can be augmented to account for morphological patterns using features automatically derived from the output of co-clustering. Finally, we consider and alternative means of extending the coverage of the lexicon, in which low-frequency terms are added to the lexicon as types and compare this approach with the token-level assignments made by the HMM.




Read more from SRI