month day, year Part of Speech Tagging in Context Michele Banko, Robert Moore Alex Cheng [email protected] Ling 575 Winter 08 Overview • Comparison of previous methods • Using context from both sides • Lexicon Construction • Sequential EM for tag sequence and lexical probabilities • Discussion Questions Previous methods • Trigram model P(t_i | t_i-1, t_i-2) • Kupiec(1992) divide lexicon into word classes – Words contained within the same equivalence classes posses the same set of POS • Brill(1995) UTBL – Uses information from the distribution of unambiguously tagged data to make label decision – Considers both left and right context • Toutanova (2003) Conditional MM – Supervised learning method – Increase accuracy from 96.10% to 96.55% • Lafferty (2001) – Compared HMM with MEMM, and CRF Contextualized HMM • Estimate the probability of a word w_i based on t_i-1, t_i and t_i+1 • Leads to higher dimensionality in the parameters • Standard absolute discounting scheme smoothing Lexicon construction • Lexicons provided for both testing and training • Initialize with uniform dist for all possible tags for each word • Experiments with using word classes in the Kupiec model Problems • Limiting the possible tags per lexicon – Tags that appeared less than X% of the time for each word are omitted. HMM Model Training • Extracting non-ambiguous tag sequence – Use these n-grams and their counts to bias the initial estimate of state transitions in the HMM • Sequential training – Train the transition model probability first, keeping the lexical probabilities constant. – Then train the lexical probabilities, keeping the transition probability constant. Discussion • Sequential training of HMM by training the parameters separately. Is there any theoretical significance? Computational cost? • What are the effects if we model the tag context differently using p(t_i | t_i1, t_i+1)? month day, year Improved Estimation for Unsupervised POS Tagging Qin Iris Wang, Dale Schuurmans Alex Cheng [email protected] Ling 575 Winter 08 Overview • Focus on parameter estimation – Considering only simple models with limited context (using a standard HMM - bigram) • Constraint on marginal tag probabilities • Smooth lexical parameters using word similarities • Discussion Questions Parameter Estimation • Banko and Moore (2004) reduces error rate from 22.8% to 4.1% by reducing the set of possible tags for each word. – Requires tagged data to find the artificially reduced lexicon. • EM is guaranteed to converge to a local maximum. • HMM tends to have multiple local maxima. – This leads to the resulting quality of the parameters may have more to do with the initial parameter estimation than the EM procedure itself. Estimations problems • Using the standard model – Tag -> tag unifrom over all tags – Tag -> word uniform over all possible tag for word (as specified in complete lexicon) • Estimated parameters of the transition probabilities are quite poor. – ‘a’ is always tagged LS. • Estimated parameters of the lexical probabilities are also quite poor – Treat each parameter b_t_w1, b_t_w2 as independent. – EM tends to over-fit the lexical model and ignore similarity between words. Marginally Constrained HMMs Tag -> Tag probabilities • Maintain a specific marginal distribution over the tag probabilities. – Assuming we are given a target distribution over tags (raw tag frequency) • Can be obtained from tagged data • Can be approximated (see Toutanova, 2003) Similarity based Smoothing Tag -> Word probabilities • Using a feature vector f for each word w which consists of the context (left and right word) of w. • Took 100,000 most frequent words as features Result Discussion • Compared to Banko and Moore, are methods used here “more or less” unsupervised? – Banko and Moore uses lexicon ablation – Here, we use raw frequency of tags
© Copyright 2026 Paperzz