Citation
Freitag D., McCallum A. Information extraction using HMMs and shrinkage, in Proceedings of the AAAI-99 Workshop on Machine Learning for Information Extraction, 1999.
Abstract
“Information extraction” refers to the process of converting documents to structured content summaries. Such summaries can be presented to users or be used by software agents engaged in text mining. This paper advocates for the use of HMMs for information extraction. The HMM state transition probabilities and word emission probabilities are learned from labeled training data. As in many learning problems, however, the lack of sufficient labeled training data hinders the reliability of the model. The key contribution of this paper is the use of relationships between HMM states and a statistical technique called “shrinkage” in order to significantly improve estimation of the HMM emission probabilities in the face of sparse training data. In experiments on seminar announcements and Reuters acquisitions articles, shrinkage is shown to reduce error by up to 40% and the resulting HMM outperforms a state-of-the-art rule-learning system.