CCT, grams, corpus, Relative Frequency, scientific/technical, term recognition, GMI, CREA, Kenneth W. Church, Natural Language Processing, Silence, IDF, John Benjamins, collocation, candidate, candidates, Real Academia, Fernando � Pardos, Madrid Spain, Jordi Porta Zamorano, information dimensions, Rafael J. Ruiz Uren, Spanish Royal Academy of Sciences, Real Academia de Ciencias, Royal Academy of Sciences
Combining statistics on -grams for automatic term recognition
Jordi Porta Zamorano , Rafael J. Ruiz Uren~ a , Fernando Saґnchez Leoґn
Departamento de LinguЁґistica Computacional
Real Academia Espan~ola ¤ Ґ c/ Felipe IV, 4. 28071 Madrid Spain almu, amunicio, fernando, porta, rafa, fsanchez @rae.es
¦ Real Academia de Ciencias
c/ Valverde, 22-24. 28004 Madrid Spain
Abstract This paper presents the work-in-progress in the development of an automatic term recognition (ATR) system built around the Corpus Cientґifico-Teґcnico (CCT). Terms are modeled using three non-correlated dimensions: unithood, domainhood and usage, applied to a set §of -grams automatically extracted from the corpus. These dimensions are combined with a supervised Machine Learning
algorithm in § order to classify -grams as terms or non-terms. Results of both noise and silence are promising given the paucity of data employed for training. Moreover, error analysis on noise reveals that other information dimensions can be used for significantly reducing noise.
1. Introduction This paper presents the work-in-progress in the development of an automatic term recognition (ATR) system built around the Corpus Cientґifico-Teґcnico (CCT). The CCT gathers Spanish texts of scientific/technical domains organized according to a taxonomy of scientific disciplines and encoded in an XCES-compliant format. The corpus is expected to contain 30 million words by the end of 2004 and it will be a source of information complementary to the glossaries and terminological dictionaries edited by the Spanish Royal Academy
of Sciences, the funding institution. Up to date, the CCT contains 122 texts from chemistry, physics, biology, medicine, mathematics and telecommunications, amounting some 1.2 million words. The texts belong to the Royal Academy of Sciences or have been obtained through agreements with technical book and journal publishers. The whole text acquisition and encoding process has been automatised (with some human intervention
) and texts should exist in a previous electronic form to achieve an acquisition faster and less error-prone (avoiding transcription/OCR mistakes). Besides this automatisation process on the source text
s, some of the Academy technical dictionaries (specifically, the Diccionario Esencial de las Ciencias, DEC) have been also processed in order to both encode the information in XML and extract the term list. However, due to the need to provide lexicographers with new term candidates rather than already known terms and also to the semantic dimension inherent to knowledge-based term recognition an empirical approach has been adopted, hence the extracted term list is only scarcely used in this paper. 2. Modeling terms Most definitions of term lay mainly on semantics and become non-operational for computing without lexical and domain knowledge. This took us to an empirical approach
Ё where -grams are extracted from the corpus and charac- terized along three different dimensions supposedly relevant for term identification. The measurements performed in each dimension have been borrowed from the fields of lexicography and Information Extraction
Ё Unithood is the degree of lexical cohesive force that is shown by the elements of an -gram. Both complex
terms and collocations from a scientific/technical corpus
have similar cohesion values, so a considerable overlap Ё may occur. Unithood is approximated by measuring the degree of association among the words contained in the gram1. Mutual information has been reported as a useful statistic for extracting collocations despite its stability problems
with low frequency data
(Church and Hanks, 1991).
Generalized Mutual Information (GMI), which deals
Ё with arbitrary , has been formulated in different ways (see
Ё© for instance Chien and Chun-Liang (2001)). For
Ё have adopted that of Yamamoto and Church (2001). The formulation of GMI for a given -gram we used in this
paper is: @B where
"!$# is the
BBB79DDDCCACA8AQRQ! @ [email protected]
@ [email protected]
Ё4#65 Ё4#I 2 P# Ё©T 2 P# Ёan -gram
CFE CAQRE and
the corpus size.
2.2. Domainhood (distribution within the corpus)
Terms are characteristic of a domain. The distribution of most, if not all, terms in the CCT should be far from uniform due to its balanced design. A useful measure from IE
, and therefore it corresponds to a single word,
this measure is not applicable.
that identifies good keywords from documents for retrieval
@ B is the Inverse Document Frequency (IDF) (SpaЁrk Jones, 1973). A measure based on IDF and , the Residual In-
verse Document Frequency (RIDF), is proposed by Church
and Gale (1999) as a better measure to extract keywords.
It selects those whose distribution is different from what is
expected assuming a Poisson. The formulation of RIDF is:
"! ## "! "! 5 ! ўЎ¤Ј
ўЎ¤Ј ¦Ґ "! ў§ Ў¤Ј ЁҐ© ў #%$ © ў
0 136 2 45
Ё where 798 is the IE document frequency (calcЎ ulated as the number of texts where an -gram occurs) and is the number of texts in the corpus.
2.3. General vs. specific usage
The frequency of a given term should be higher in a spe-
cialized corpus than in a non-specialized one. A subcorpus
of the CREA2 was created to represent a non-scientific/non-
technical genre. In order to get maximum variability within
the genre, the 250 smallest literary book texts in the CREA Ё were selected obtaining a 17 million-word corpus. All Ё @ B grams (ranging from 1 to 5) were generated and the
computed with the algorithm described in @ 3.
This subcorpus acts as an exclusion filter or blank refer-
ence for determining usage. We use the Relative Frequency
Ratio (A ) between the CCT and this subcorpus of the CREA Ё to compare -gram usage3. This ratio is calculated by:
"! # ""! ! A
8 CREA 8 CCT
Ё In Fig. 1 it is shown a sample of 400 manually classified -grams (see @ 5.1.) plotted according to the relative frequency values in the CCT and in the subcorpus of the CREA. The sample is composed of 200 terms (painted black) and Ё 200 non-terms (painted white). A diagonal line divides the plane in two. The upper region contains -grams whose Ё relative frequency in the CREA is greater than in the CCT whereas the lower region contains those -grams more frequent in the CCT. As expected, terms are located below the diagonal line.
3. B -gram statistics computation @ B Yamamoto and Church (2001) describe a fast algorithm to compute and 798 for all the substrings of a corpus. This @ B Ё algorithm makes use of suffix arrays and some properties in order to group substrings (i.e. variable length -grams) into equivalence classes with the same and 798 . This partition of substrings leads to a drastic reduction of the num- @ B ber of elements and allows, with the introduction of binary search, the computation of other statistics based on , such as GMI. Ё In order to restrict the amount of -grams extracted with Ё the algorithm, only those with frequency over 3 and rangЁ ing from 1 to 5 are produced. The decision to limit to 5
2The CREA is a reference corpus of current Spanish containing some 130 million words and resembling the BNC in its balanced § design (Martґin Municio et al., 2000). 3Frequencies for -grams not represented in the CREA are smoothed by adding one.
5e-05 4.5e-05 4e-05 3.5e-05 3e-05 2.5e-05 2e-05 1.5e-05 1e-05 5e-06 0 0
terms non-terms CREA=CCT
Figure 1: Relative Frequency Ratio
was motivated by the fact that the number of entries with same length listed in the DEC suffers an exponential decay as length increases (see Table 1) and 99.6% of the entries have length below 6.
Len. # Entr.
Table 1: Distribution of entry lengths in the DEC
4. Distribution of terms To test how well these statistics distinguish terms and non-terms and how they are distributed along the dimen- Ё sions chosen, some scatter plots have been drawn with a random sample of 200 -grams. In the case of plots involving GMI, the sample does not include unigrams. All plots focus on the most populated regions (sometimes excluding also outliers). As noted by Yamamoto and Church (2001) and shown in Fig. 2, GMI and RIDF do not exhibit apparent correlation. Multiword terms are concentrated in the upper half of the plot giving GMI more discriminative power than RIDF. Ё It can be noted that terms and non-terms are not perfectly split. This situation gets worse when more -grams are taken into account. Roughly the same can be said of Fig. 3, except that IDF concentrates slightly more terms on higher values. Ё It can be observed in Fig. 4 that -grams with A far from Ё zero and high GMI are terms. GMI can be low for those - grams near zero frequency in the CREA.
20 terms non-terms 15
18 16 14 12 10 8 6 4 2 0 -2 0
Figure 2: GMI vs. RIDF 20 terms non-terms 15
Figure 4: GMI vs. A
4 3.5 3 2.5 2 1.5 1 0.5 0 -0.5 0
0.02 0.04 0.06 0.08
0.12 0.14 0.16 0.18
Figure 3: GMI vs. IDF
Figure 5: RIDF vs. A
Ё Distribution of -grams in Figs. 5 and 6 are not so clear as in previous cases. These plots represent the most confused part of the whole picture. It is doubtful whether these two statistics neatly distribute terms and non-terms but their plotting serves to show that no evident correlation seems to exist for these pairs of dimensions. 5. Learning to identify term candidates Ё A supervised machine learning algorithm is used for the task of classifying -grams into terms and non-terms combining their statistical measures. A decision tree is in- Ёduced using C4.5 (Quinlan, 1993) from a classified list of -grams. A recent similar approach is explained by Vivaldi et al. (2001), who propose a combination of different classifiers using boosting. Ё An initial training-set was obtained by random sam- pling the entire list of -grams. Preliminary experiments showed that induced trees better classified non-terms because of the unequal distribution of classes (most non-terms and a few terms). Precision on the training-set for terms was only of 50% vs. 90% of non-terms that were correctly classified. Thus, more terms were manually extracted from texts and added (up to 1,740) to the training-set in order to overcome this distribution problem. Evaluation was carried out on a test-set of 14,777 examples (1,801 terms) not previously seen by the algorithm
during the training phase. Experiments were ran 100 times to obtain average figures. Table 2 displays the confusion matrix delivered by C4.5 where predicted terms and nonterms are represented in columns $ and Ґ , respectively. AsЎ usual in evaЎ luating ATR systems, instead of precision ( ЎЈўҐ¤ )/recall ( ЎЈўҐ¦ ) (common in IR), we used the complementary measures silence and noise. Silence ¦ is a measure of true terms not detected by the system ( ЎЈўҐ¦ ), whereas noise is t¤ he measure of false terms proposed as term candidates ( ЎЈўҐ¤ ). Ё Parallel experimentation for two systems has been car- ried out --one using GMI, RIDF, A and , and another using IDF instead of RIDF. Noise and silence, despite of their variability, have a downward trend as the training-set grows (see Figs. 7 and 8). Slightly better results for noise are achieved with IDF while silence gives worse results with RIDF. @ B Ё The system was finally trained considering GMI, RIDF, A , and with all terms (3,541) and 29,000 non-terms. With this training-set distribution, both noise and silence levels, measured on the same training-set plus the rest of non-terms, were similar: noise = 31.54% and silence = 32.54%. The analysis of these errors can be found in @ 6. Even though C4.5 outputs a CF (certainty coefficient) associated to the class decision, thresholds on CF had not been used because the training-set has not been considered
7 terms non-terms 6
0.23 0.225 0.22 0.215 0.21 0.205 0.2 0.195 0.19 0
Noise w/RIDF Noise w/IDF
Figure 6: IDF vs. A
$ $C ҐE
Table 2: Confusion matrix
to have the necessary size for reliable CF calculation. In the framework of ATR, these CFs could be interpreted as a termhood index that can be used to rank term candidates to be presented to the lexicographers. 5.1. Manually selected candidates vs. dictionary extracted terms Terms automatically extracted from the DEC (an upperintermediate level technical dictionary) provide the algo- Ё rithm with only positive examples. The training set was obtained by random sampling
of the list of -grams and then manual classification was carried out by linguists with Ё no special background in science or terminology and no in- struction on what a term is (only very polysemous -grams, usually corresponding to single words, were said to be classified as non-terms). Surprisingly, scatter plots of manually selected candidates vs. dictionary extracted ones (Fig. 9) exhibit what Ё we interpret as a strong correlation. Termhood judgments by linguists tend to choose -grams with measures similar to those extracted from the DEC and found in the corpus. 6. Analysis of errors We have concentrated in the analysis of noise, since it is easier to build a set of trivial filters to reduce it, rather in the reasons for silence. The number of false terms is 1,169, being most of them unigrams (790). Given that the most discriminative dimension (GMI) is missing for unigrams, filters have been tested on the remaining errors (32.4% of noise errors). The first set of filters exclude term candidates including punctuation (79 are excluded, 20.8% of the remaining noise errors), numbers (72, 19% of the errors), a given stop word at the beginning or end of the candidate (42, 11.1%)4 and a given English stop word at the beginning or end of the 4This filter is built around a couple of dozens of grammatical
0.445 0.44 0.435 0.43 0.425 0.42 0.415 0.41 0.405 0
Figure 7: Noise
Silence w/RIDF Silence w/IDF
Figure 8: Silence
candidate (14, 3.7%). The reduction is significant since it falls to 201 of the remaining 379 noise errors. The second set of filters is based on other linguistic requirements expressed as negative constraints. We could have forced the candidates to conform to a given category sequence, as many authors propose (Cabreґ et al., 2001), but then many potentially terminological chunks would have been lost. Thus, the constraints are conservative and are based on observations over term candidates as well as over manually selected terms. All of them try to sharpen the fuzzy border between collocation and term. The first has to do with coordination. Frequently coordinated (IE) terms show a strong lexical relationship, a cohesive force that may better match the notion of collocation than that of termhood, as seen in the variety of relationships, that includes words in ordered series (XVI y XVII, B y C), unordered sets (forma y funciґon, masculino y femenino) or co-hyponyms (lagartos y serpientes, ovejas y cabras). None of them are terms but, in some cases, coordinated terms. This is also true for most the manually selected terms including a coordinated conjunction. This filter amounts for a 9% reduction words and it is based on the observation that any term candidate must be a full (non-determinised) constituent, thus prepositions and determiners, for instance, can neither be the first nor the last element of a candidate.
DEC vs. Manual 18 Manual DEC 16
Figure 9: DEC extracted terms vs. Manually selected candidates
(18 excluded candidates) of the new 201 error set.
The error list also contains complex verbs, periphrases
and control structure
s that can be easily excluded with a
chunker (or even using an exclusion filter with a not so lit-
erary bias, since the structures are mainly informal). With
this filter, 13 (6.5%) candidates are excluded.
Careful inspection of the rest of noise errors unveils
some errors in manual selection. This affects to 29 cases
(14.4%), that where considered non-terms by a linguist.
This fact highlights the domain dependency of termhood.
As Nunan (1993, 30) puts it: "Collocational patterns will
only be perceived by someone who knows something about
Ё © # the subject at hand." Overall noise for n-grams (where
been reduced to 37.2% with these simple filters. More-
over, the rest of false terms include frequent technical
collocations (uso masivo, prestigiosa revista,
trabajos pioneros, niveles elevados, alta
radiactividad) and common head-internal argu-
ment sequences (tiro un dado, valiґo el pre-
mio, expresan telomerasa, medir distan-
cias, sufren metamorfosis).
In Recent Advances in Computational Terminology, volume 2 of Natural language processing
, pages 5387. John Benjamins
. Lee-Feng Chien and Chen Chun-Liang. 2001. Incremental extraction of domain-specific terms from online text resources. In Recent Advances in Computational Terminology, volume 2 of Natural Language Processing
, page 89. John Benjamins. Kenneth W. Church and William Gale. 1999. Inverse Document Frecuency (IDF): A Measure of Deviation from Poisson. In Natural Language Processing Using Very Large Corpora, Volume 11
of Text, Speech and Language Technology, pages 283295. Kluwer Academic Publishers
. Kenneth W. Church and Patrick Hanks. 1991. Mutual Information and Lexicography. Computational Linguistics
, 16(1):2229. Aґ ngel Martґin Municio, Guillermo Rojo, Fernando Saґnchez Leoґn, and Octavio Pinillos. 2000. Language Resources Development a t the Spanish Royal Academy. In Proceedings of the 2Ў International Conference
on Language Resources and Evaluation (LREC 2000
), volume II, pages 12651270, Athens, Greece
. David Nunan. 1993. Introducing discourse analysis
. Penguin. J. R. Quinlan. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann
, San Mateo, CA
. Karen SpaЁrk Jones. 1973. A Statistical Interpretation of Term Specificity and its Application in Retrieval. Journal of Documentation, 28(1):1121. Jordi Vivaldi, Lluis Ma`rquez, and Horacio Rodrґiguez. 2001. Improving Term Extraction by System Combination using Boosting. In Proceedings of the joint ECMLPKDD'01 Conference, Freiburg, Germany. Mikio Yamamoto and Kenneth W. Church. 2001. Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus. Computational Linguistics, 27(1):130, March.
7. Conclusions and future work This paper presents a language- and domain- Ё independent methodology for ATR based on the com- bination of statistical measures on -grams. Results of both noise and silence are promising given the paucity of data employed for training. The explored dimensions can be tuned substituting current approximations for unithood by other proven useful statistics like log- Ё likelihood ratio
or MI (after generalising their formulae to arbitrary length -grams) and for domainhood by other statistics of dispersion. Moreover, error analysis on noise reveals that a set of trivial filters significantly reduces noise, thus opening the possibility for new dimensions to be taken into account.
8. References Ma. Teresa Cabreґ, Rosa Estopaґ, and Jordi Vivaldi. 2001. Automatic term detection: A review of current system
A Ballester, ÁM Municio, F Pardos, JP Zamorano