Edinburgh, M. Federico, WB, IEEE Trans, Association for Computational Linguistics, language models, Pr, language modeling, Discounting, relative frequency, Discounting Methods, Europarl corpus, Cettolo, M., Web1 T5gr M. Federico SLM MT Marathon, MKN, Ney, Statistical Machine Translation, MT Marathon, Readings in Speech Recognition, word probabilities, population frequencies, probabilities
Content:
M. Federico
Language Modelling Marcello Federico FBKirst Trento, Italy MT Marathon, Edinburgh, 2012
SLM
MT Marathon, Edinburgh, 2012
Outline
1
· Role of LM in ASR and MT · Ngram
language models · Evaluation of Language Models · Smoothing Schemes · Discounting Methods · Classbased LMs · MaximumEntropy LMs · Neural Network LMs · Toolkits and ARPA file format
M. Federico
SLM
MT Marathon, Edinburgh, 2012
Fundamental Equation of ASR
2
Goal: find the words w in a
speech signal x such that:
w = arg max Pr(x  w) Pr(w)
(1)
w
Problems: · language modeling (LM): estimating Pr(w) · acoustic modeling (AM): estimating Pr(x  w) · search problem: computing (1) AM sums over hidden state sequences s a Markov process of (x, s) from w
Pr(x  w) = Pr(x, s  w) s Hidden Markov Model: hidden states "link" speech frames to words.
M. Federico
SLM
MT Marathon, Edinburgh, 2012
Fundamental Equation of SMT
3
Goal: find the English string e translating the foreign text f such that:
e = arg max Pr(f  e) Pr(e)
(2)
e
Problems:
· language modeling (LM): estimating Pr(e) ·
translation modeling (TM): estimating Pr(f  e) · search problem: computing (2) TM sums over hidden alignments a a
stochastic process generating (f , a) from e.
Pr(f  e) = Pr(f , a  e) a Alignment Models: hidden alignments "link" foreign words with English words.
M. Federico
SLM
MT Marathon, Edinburgh, 2012
ASR and MT Architectures
4
Input Output
Preprocessing Source Search Target Postprocessing
Target Texts LM
Parallel Data AM/TM
· Parallel data are samples of observations (x, w) and (f , e) · AM and TM can be machinelearned without observing s and a.
M. Federico
SLM
MT Marathon, Edinburgh, 2012
Linear phrasebased SMT
5
· Translation hypotheses are ranked by:
e = arg max ihi(e, f , a) e,a i
· Phrases are finite strings (cf. ngrams) · Hidden variable a embeds: segmentation of f and e into phrases alignment of phrases of f with phrases of e · Feature functions hk() include: Translation Model: appropriateness of phrasepairs Distortion Model: word reordering Language Model: fluency of target string Length Model: number of target words · Role of the LM is exactly the same as for the noisy channel approach: to score translations incrementally generated by the search algorithm!
M. Federico
SLM
MT Marathon, Edinburgh, 2012
Ngram Language Model
6
Given a text w = w1 . . . , wt, . . . , ww we can compute its probability by:
w
Pr(w) = Pr(w1) Pr(wt  ht)
(3)
t=2
where ht = w1, . . . , wt1 indicates the history of word wt. · Pr(wt  ht) becomes difficult to estimate as the history ht grows . · hence, we take the ngram approximation ht wtn+1 . . . wt1 e.g. Full history: Pr(Parliament  I declare resumed the session of the European) 3  gram : Pr(Parliament  the European)
The choice of n determines the complexity of the LM (# of parameters): · bad: no magic recipe about the optimal order n for a given task · good: language models can be evaluated quite cheaply
M. Federico
SLM
MT Marathon, Edinburgh, 2012
Language Model Evaluation
7
· Extrinsic: impact on task (e.g. BLEU score for MT) · Intrinsic: capability of predicting words The perplexity (PP) measure is defined as: 1
P P = 2LP where
1 LP = w log2 p(w)
(4)
· w is a sufficiently long test sample and p(w) is the LM probability.
Properties: · 0 P P V  (size of the vocabulary V ) · predictions are as good as guessing among P P equally likely options
Good news: there is typical
strong correlation between PP and BLEU scores! 1[Exercise 1. Find PP of 1gram LM on the sequence T H T H T H T T H T T H for p(T)=0.3, p(H)=0.7 and p(H)=0.3, p(T)=0.7. Comment the results.]
M. Federico
SLM
MT Marathon, Edinburgh, 2012
Trainset vs. testset perplexity
8
For an ngram LM, the LP quantity can be computed as follows:
1 w
LP =  w
log2 p(wt  ht).
t=1
PP is a function of a LM and a text: · the lower the PP the better the LM · testset PP evaluates LM generalization capability · PP strongly penalizes zero probabilities · trainset PP measures how good the LM explains
Training dataNote: trainset PP is strictly related to the trainset likelihood.
M. Federico
SLM
MT Marathon, Edinburgh, 2012
N gram Probabilities
9
Estimating ngram probabilities is not trivial due to: · model complexity: e.g. 10,000 words correspond to 1 trillion 3grams! · data sparseness: e.g. most 3grams are rare events even in huge corpora. Relative frequency estimate: MLE of any discrete
Conditional distribution is:
c(w  x y) c(x y w)
f (w  x y) =
=
w c(w  x y) c(x y)
where ngram counts c(·) are taken over the training corpus.
Problem: relative frequencies in general overfit the training data · if the test sample contains a "new" ngram PP + · this is largely the most frequent case for n 3
We need frequency smoothing!
M. Federico
SLM
MT Marathon, Edinburgh, 2012
Frequency Smoothing
10
Issue: f (w  x y) > 0 only if c(x y w) > 0 Idea: for each observed w take off some fraction of probability from f (w  x y) and redistribute the total to all words never observed after x y. · the discounted frequency f (w  x y) satisfies: 0 f (w  x y) f (w  x y) x, y, w V
· the total discount is called zerofrequency probability (x y):2
(x y) = 1.0 
f (w  x y)
wV
How to redistribute the total discount?
2Notice: by convention (x y) = 1 if f (w  x y) = 0 for all w, i.e. c(x y) = 0.
M. Federico
SLM
MT Marathon, Edinburgh, 2012
Discounting Example
11
M. Federico
SLM
MT Marathon, Edinburgh, 2012
Frequency Smoothing
12
Insight: redistribute (x y) according to the lowerorder smoothed frequency. Two major hierarchical schemes to compute the smoothed frequency p(w  x y): · Backoff, i.e. select the best available ngram approximation:
p(w  x y) =
f (w  x y)
if f (w  x y) > 0
xy Ч (x y)p(w  y) otherwise
(5)
where xy is an appropriate normalization term.3 · Interpolation, i.e. sum up the two approximations:
p(w  x y) = f (w  x y) + (x y)p(w  y).
(6)
Smoothed frequencies are learned bottomup, starting from 1grams ...
3[Exercise
2.
Find
and
expression
for
xy
s.t.
P w
p(w

x
y)
=
1.]
M. Federico
SLM
MT Marathon, Edinburgh, 2012
Frequency Smoothing of 1grams
13
Unigram smoothing permits to treat outofvocabulary (OOV) words in the LM. Assumptions: · U : upperbound estimate of the size of the true vocabulary · f (w) > 0 on observed vocabulary V , e.g. f (w) = c(w)/(N + V ) · : total discount reserved to OOV words, e.g. = N/(N + V ) Then: 1gram backoff/interpolation schemes collapse to:
p(w) =
f (w)
if w V
Ч (U   V )1 otherwise
(7)
Notice: we introduce approximations when an OOV word o appears: p(w  h1 o h2) = p(w  h2) and p(o  h) = p(o)
Important: use a common value U  when comparing/combining different LMs!
M. Federico
SLM
MT Marathon, Edinburgh, 2012
Discounting Methods
14
Linear interpolation (LI) [Jelinek, 1990] · Insight: learn (x y) from data with a mixture model MLE on some heldout data (
EM algorithm) · Solution: f (w  xy) = (1  ([x y])f (w  xy) and 0 ([x y]) 1
the notation [x y] means that a map is applied to reduce the set of parameters, e.g., according to the frequency of the last word in the history:
0
if c(y) k1
[x y] = c(y) if k1 < c(y) k2
y + k2 if k2 < c(y)
· Pros: sound and robust · Cons: oversmooths frequent ngrams
M. Federico
SLM
MT Marathon, Edinburgh, 2012
Discounting Methods
15
WittenBell estimate (WB) [Witten and Bell, 1991] · Insight: count how often you would backoff after x y in the training data corpus: x y u x x y t t x y u w x y w x y t u x y u x y t assume (x y) number of backoffs (i.e. 3) hence f (w  x y) relative frequency (linear discounting) · Solution:
(x y) =
n(x y )
and f (w  xy) =
c(x y w)
c(x y) + n(x y )
c(x y) + n(x y )
where c(x y) = w c(x y w) and n(x y ) = {w : c(x y w) > 0}. 4
· Pros: easy to compute, robust for small corpora, works with artificial data. · Cons: underestimates probability of frequent ngrams
4[Exercise 3. Compute f (u  x y) with WB on the above artificial text.]
M. Federico
SLM
MT Marathon, Edinburgh, 2012
Discounting Methods
16
· Interpolation and backoff with WB discounting
· Trigram LMs estimated on the English Europarl corpus
· Logprobs of 3grams of type aiming at observed in training
1.5
2
2.5
3
logprob
3.5
4
4.5
5
WB interpolation WB backoff
5.5
6 10 20 30 40 50 60 70 80 90 100
Top frequent 3gr 'aiming at _' in Europarl
· Peaks correspond to very probable 2grams interpolated with f respectively: at that, at national, at European
· Backoff performs slightly better than interpolation but costs more
M. Federico
SLM
MT Marathon, Edinburgh, 2012
Discounting Methods
17
Absolute Discounting (AD) [Ney and Essen, 1991] · Insight: high counts are be more reliable than low counts subtract a small constant (0 < 1) from each count estimate by maximizing the leavingoneout likelihood of the training data · Solution: (notice: one distinct for each ngram order)
f (w  x y) = max
c(xyw)  ,0
which gives
(xy) =
w:c(xyw)>1 1
c(xy)
c(xy)
where
n1 n1+2n2
1
and
nr
=
{x
y
w
:
c(x
y
w)
=
r}.
5
· Pros: easy to compute, accurate estimate of frequent ngrams. · Cons: problematic with small and artificial samples.
5[Exercise 4. Given the text in WB slide find the number of 3grams, n1, n2, , f (w  x y) and (x y)]
M. Federico
SLM
MT Marathon, Edinburgh, 2012
Discounting Methods
18
KneserNey method (KN) [Kneser and Ney, 1995] · Insight: lower order counts are only used in case of backoff estimate frequency of backoffs to y w in the training data (cf. WB) corpus: x y w x t y w t x y w u y w t y w u x y w u u y w replace c(x y) with n( y w) = # of observed backoffs (=3) · Solution: (for 3gram normal counts)
f (w  y) = max
n( y w)  ,0
which gives
(y) =
w:n( y w)>1 1
n( y )
n( y )
where n( y w) = {x : c(x y w) > 0} and n( y ) = {x w : c(x y w) > 0}
· Pros: better backoff probabilities, can be applied to other smoothing methods · Cons: LM cannot be used to compute lower order ngram probs
M. Federico
SLM
MT Marathon, Edinburgh, 2012
Discounting Methods
19
Modified KneserNey (MKN) [Chen and Goodman, 1999] · Insight:  specific discounting coefficients for infrequent ngrams  introduce more parameters and estimate them with leavingoneout · Solution: f (w  x y) = c(x y w)  (c(x y w)) c(x y) where (0) = 0, (1) = D1, (2) = D2 , (c) = D3+ if c 3, coefficients are computed from nr statistics, corrected counts used for lower order ngrams · Pros: see previous + more fine grained smoothing · Cons: see previous + more sensitiveness to noise Important: LM interpolation with MKN is the most popular smoothing method. Under proper training conditions it gives the best PP and BLEU scores!
M. Federico
SLM
MT Marathon, Edinburgh, 2012
Discounting Methods
20
· Interpolation with WB and MKN discounting (Europarl corpus) · The plot shows the logprob of observed 3grams of type aiming at 1
2
3
logprob
4
5
WittenBell
6
Modified KneserNey
7 10 20 30 40 50 60 70 80 90 100 Top frequent 3gr 'aiming at _' in Europarl · Notice that for less frequent 3grams WB assigns higher probability · We have three very high peaks corresponding to large corrected counts: n(*at that)=665 n(* at national)=598 n(* at European)=1118 · Another interesting peak at rank #26: n(* at very)=61
M. Federico
SLM
MT Marathon, Edinburgh, 2012
Discounting Methods
21
· Train: interpolation with WB and MKN discounting (Europarl corpus)
· Test: 3grams of type aiming at (Google 1TWeb corpus)
logprob
0
2
4
6
8
10
12
14
16
WittenBell
18
Modified KneserNey
20
22 10 20 30 40 50 60 70 80 90 100
Top frequent 3gr 'aiming at _' in Web1 T5gr
M. Federico
SLM
MT Marathon, Edinburgh, 2012
Discounting Methods
22
· Train: interpolation with WB and MKN discounting (Europarl corpus) · Test: 3grams of type aiming at (Google 1TWeb corpus) · Plot: cumulative score differences between MKN and WB on top 1000 3grams
N x logprob
1200
1000
800
600
400
200 Cumulative score difference MSBWB 0
200 0
100 200 300 400 500 600 700 800 900 1000 Top frequent 3gr 'aiming at _' in Web1 T5gr
M. Federico
SLM
MT Marathon, Edinburgh, 2012
Approximate Smoothing
23
· LM Quantization [Federico and Bertoldi, 2006] Idea: one codebook for each ngram/backoff level Pros: improves storage efficiency Cons: reduces discriminatory power Experiments with 8bit quantization on ZHEN NIST task showed: * 2.7% BLEU drop with a 5gram LM trained on 100Mwords * 1.6% BLEU drop with a 5gram LM trained on 1.7G words. · Stupid backoff [Brants et al., 2007] no discounting, no corrected counts, no backoff normalization
p(w  x y) =
f (w  x y) if f (w  x y) > 0 k · p(w  y) otherwise
(8)
where k = 0.4 and p(w) = c(w)/N .
M. Federico
SLM
MT Marathon, Edinburgh, 2012
Is LM Smoothing Necessary?
24
From [Brants et al., 2007]. SB=stupid backoff, KN=modified KneserNey
· Conclusion: proper smoothing useful up to 1 billion word training data!
M. Federico
SLM
MT Marathon, Edinburgh, 2012
Factored LMs
25
· Use less sparse representation of words than surface form words e.g. partofspeech, semantic classes, lemmas, automatic clusters · Higher chance to match longer ngrams in test sequences allows to model longer dependencies, to capture more syntax structure · For a text w we assume a corresponding class sequence g ambiguous (e.g. POS) or deterministic (word classes) · Factored LMs can be integrated into log
linear models with: a wordtoclass factored model: f e g with features:
h1(f , e) , h2(f , g) , h3(f ) , h4(g) a wordclass joint model f (e, g) with features
h1(f , e, g) , h2(f ) , h3(g) Features of single sequences are logprobs of standard ngram LMs.
M. Federico
SLM
MT Marathon, Edinburgh, 2012
Maximum Entropy Ngram LMs
26
· The ngram prob is modeled with loglinear model [Rosenfeld, 1996]:
p(w  h) =
exp ( w exp
m r=1
r
hr(h,
w))
(
m r=1
r
hr(h,
w
))
=
1 Z (h)
exp
m rhr(h, w) r=1
· hr(·) are feature functions (arbitrary statistics), r are free paramenters · Features can model any dependency between w and h. · Given feature functions and training sample w, parameters can be estimated [Berger et al., 1996] by maximizing the posterior loglikelihood:
w
^ = arg max Rm
log p(wt  ht) + log q()
t=1
· where the second term is a regularizing Gaussian prior · ME ngrams are rarely used: perform comparably but at higher computational costs, because of the partition function Z(h).
M. Federico
SLM
MT Marathon, Edinburgh, 2012
Neural Network LMs
27
· Most promising among recent development on ngram LMs. · Idea: Map single word into a V dimensional vector space Represent ngram LM as a map between vector spaces · Solution: Learn map with neural network (NN) architecture one hidden layer compress information (projection) second hidden layer performs the ngram prediction other architectures are possible: e.g. recurrent NN · Implementations: Continuous Space Language Model [Schwenk et al., 2006] Recurrent Neural Network Language Modeling Toolkit 6 · Pros: Fast runtime, competitive when used jointly with
Standard Model · Cons: Computational cost of training phase Not easy to integrate into search algorithm (used in rescoring) 6http://rnnlm.sourceforge.net
M. Federico
SLM
MT Marathon, Edinburgh, 2012
Neural Network LMs
28
(From [Schwenk et al., 2006])
M. Federico
SLM
MT Marathon, Edinburgh, 2012
Language Modelling Today
29
· Availability of large scale corpora has pushed research toward using huge LMs · MT systems set for evaluations use LMs with over a billion of 5grams · Estimating accurate large scale LMs is still computationally costly · Querying large LMs can be carried out rather efficiently (with adequate RAM)
Available LM toolkits · SRILM: training and runtime, Moses support, open source (no commercial) · IRSTLM: training and runtime, Moses support, open source · KENLM: runtime, Moses support, open source Interoperability · The standard for ngram LM representation is the socalled ARPA file format.
M. Federico
SLM
MT Marathon, Edinburgh, 2012
ARPA File Format (srilm, irstlm)
30
Represents both interpolated and backoff ngram LMs · format: log(smoothedprob) :: ngram :: log(backoff weight) · computation: look first for smoothedprob, otherwise backoff
ngram 1= 86700 ngram 2= 1948935 ngram 3= 2070512
\1grams: 2.94351 6.09691 2.88382
world 0.51431
friends 0.15553
!
2.38764
...
\2grams: 3.91009 3.91257 3.87582
world !
0.3514
hello world 0.2412
hello friends 0.0312
...
\3grams: 0.00108 0.00027 ...
hello world ! hi hello !
\end\
M. Federico
SLM
MT Marathon, Edinburgh, 2012
ARPA File Format (srilm, irstlm)
31
Represents both interpolated and backoff ngram LMs · format: log(smoothedprob) :: ngram :: log(backoff weight)
· computation: look first for smoothedprob, otherwise backoff
ngram 1= 86700 ngram 2= 1948935 ngram 3= 2070512
\1grams: 2.94351 6.09691 2.88382
world 0.51431
friends 0.15553
!
2.38764
...
\2grams: 3.91009 3.91257 3.87582
world !
0.3514
hello world 0.2412
hello friends 0.0312
...
\3grams: 0.00108 0.00027 ...
hello world ! hi hello !
\end\
Query: Pr( ! / hello friends )? 1. lookup logPr(hello friends !) failed! then backoff 2. lookup logBow(hello friends) res=0.0312 3. lookup logPr(friends !) failed! then backoff 4. lookup logBow(friends) res=res0.15553 5. lookup logPr(!) res=res2.88382 6. prob=exp(res)=0.04640
M. Federico
SLM
MT Marathon, Edinburgh, 2012
32
References
[Berger et al., 1996] Berger, A., Pietra, S. A. D., and Pietra, V. J. D. (1996). A maximum entropy approach to natural
language processing. Computational Linguistics, 22(1):3971. [Brants et al., 2007] Brants, T., Popat, A. C., Xu, P., Och, F. J., and Dean, J. (2007). Large language models in machine translation. In Proceedings of the 2007 Joint Conference on
Empirical Methods in Natural Language Processing and Computational Natural
Language Learning, pages 858867, Prague,
Czech Republic. [Chen and Goodman, 1999] Chen, S. F. and Goodman, J. (1999). An empirical study of smoothing techniques for language modeling. Computer Speech and Language, 4(13):359 393. [Federico, 1999] Federico, M. (1999).
Usability Evaluation of a spoken dataentry interface. In Proceedings of the IEEE
International Conference on Multimedia Computing and Systems, volume I, pages 726731,
Florence, Italy. [Federico and Bertoldi, 2001] Federico, M. and Bertoldi, N. (2001). Broadcast news LM adaptation using contemporary texts. In Proceedings of the 7th European Conference on
Speech communication and Technology, Aalborg, Denmark. [Federico and Bertoldi, 2006] Federico, M. and Bertoldi, N. (2006). How many bits are needed
M. Federico
SLM
MT Marathon, Edinburgh, 2012
33 to store probabilities for phrasebased translation? In Proceedings on the Workshop on Statistical Machine Translation, pages 94101,
New York City.
Association for Computational Linguistics. [Federico et al., 2008] Federico, M., Bertoldi, N., and Cettolo, M. (2008). Irstlm: an open source toolkit for handling large scale language models. In Proceedings of Interspeech,
Brisbane, Australia. [Franz and McCarley, 2002] Franz, M. and McCarley, J. S. (2002). How many bits are needed to store term frequencies? In Proceedings of ACM SIGIR, pages 377378, Tampere, Finland. [Good, 1953] Good, I. J. (1953). The population frequencies of species and the estimation of
population parameters. Biometrika, 40:237264. [Jelinek, 1990] Jelinek, F. (1990). Selforganized language modeling for speech recognition. In Weibel, A. and Lee, K., editors, Readings in Speech Recognition, pages 450505.
Morgan Kaufmann, Los Altos, CA. [Katz, 1987] Katz, S. M. (1987). Estimation of probabilities from sparse data for the language model component of a speech recognizer.
IEEE Trans. Acoust., Speech and Signal Proc., ASSP35(3):400401. [Kneser and Ney, 1995] Kneser, R. and Ney, H. (1995). Improved backingoff for mgram language modeling. In Proceedings of the IEEE International Conference on Acoustics, Speech and
signal processing, volume 1, pages 181184, Detroit, MI.
34 [Nadas, 1985] Nadas, A. (1985). On Turing's formula formula for word probabilities. IEEE Trans. Acoust., Speech and Signal Proc., ASSP32:14141416. [Ney and Essen, 1991] Ney, H. and Essen, U. (1991). On smoothing techniques for bigrambased natural language modelling. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pages S12.11:825828,
Toronto, Canada. [Rosenfeld, 1996] Rosenfeld, R. (1996). A maximum entropy approach to adaptive statistical language modeling. Computer Speech and Language, 10:187228. [Schwenk et al., 2006] Schwenk, H., Dechelotte, D., and Gauvain, J.L. (2006). Continuous space language models for statistical machine translation. In Proceedings of the COLING/ACL 2006 Main Conference
Poster sessions, pages 723730,
Sydney, Australia. Association for Computational Linguistics. [Whittaker and Raj, 2001] Whittaker, E. and Raj, B. (2001). Quantizationbased language model compression. In Proceedings of Eurospeech, pages 3336, Aalborg, Denmark. [Witten and Bell, 1991] Witten, I. H. and Bell, T. C. (1991). The zerofrequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Trans. Inform. Theory, IT37(4):10851094.