Citation
Freitag D. Multistrategy learning for information extraction, in Proceedings of ICML 98, 1998.
Abstract
Information extraction (IE) is the problem of filling out predefined structured summaries from text documents. We are interested in performing IE in nontraditional domains where much of the text is often ungrammatical such as electronic bulletin board posts and Web pages. We suggest that the best approach is one that takes into account many different kinds of information and argue for the suitability of a multistrategy approach We describe learners for IE drawn from three separate machine learning paradigms: rote memorization, termspace text classification and relational rule induction. By building regression models mapping from learner confidence to probability of correctness and combining probabilities appropriately it is possible to improve extraction accuracy over that achieved by any individual learner. We describe three different multistrategy approaches. Experiments on two IE domains a collection of electronic seminar announcements from a university computer science department and a set of newswire articles describing corporate acquisitions from the Reuters collection demonstrate the effectiveness of all three approaches.