The proper place of men and machines in language technology: Processing Russian without any linguistic knowledge

Tags: lemmatisation, RNC, Conference, statistical approaches, International Conference, Natural Language Processing, machine learning, Association for Computational Linguistics, statistical methods, Language Resources and Evaluation, Fred Jelinek, LANGUAGE TECHNOLOGY, J. Nivre Russian, Uppsala University, language processing, linguistic phenomena, Part-Of-Speech tagging, processing, language translation, machine learning algorithms, corpus linguistics, linguistic rules, Russian language, University of Leeds, UK J. Nivre, LINGUISTIC KNOWLEDGE, TnT, dependency parsing, International, Language Learning, Computational Linguistics, McDonald and Nivre, Russian corpora, dependency graphs, J. Nivre, tagset, Syntactic parsing, Martin Kay
Content: THE PROPER PLACE OF MEN AND MACHINES IN LANGUAGE TECHNOLOGY. PROCESSING RUSSIAN WITHOUT ANY LINGUISTIC KNOWLEDGE S. Sharov ([email protected]) University of Leeds, UK J. Nivre ([email protected]) Uppsala University, Sweden The paper describes several experiments aimed at designing tools for processing Russian texts, namely for Part-Of-Speech tagging, lemmatisation and syntactic parsing, exploiting exclusively statistical approaches without coding any linguistic rules specifically for Russian. While not claiming any new ground for machine learning research, the results demonstrate the possibility to create state-of-the-art tools for Russian in very short time using only machine learning and no hard-coded linguistic knowledge. One of the results of this study is a set of publicly available resources which can be used in standard pipelines for processing Russian. However, they also demonstrate hidden costs associated with the use of purely statistical methods and the need to integrate linguistic parameters into statistical procedures. Key words: language technology, processing texts, machine, machine learning. 1. Introduction The title of this paper refers to a famous Research Report produced by Martin Kay in the 1980s, "The proper place of men and machines in language translation", finally published in (Kay, 1997), in which Kay argued for the proper distribution of labour between the human translators and the Computer-assisted Translation systems. Another reference appropriate to the topic of the paper presented here is a statement attributed to Fred Jelinek "Every time I fire a linguist the results of speech recognition go up", i. e. explicit linguistic knowledge is dispensable.1 This sentiment is related to a paradigmatic shift that happened in the computational linguistics in the beginning of the 1990s: with more and more data available and with the advance in the methods of machine learning, more approaches switched from careful encoding of linguistic phenomena to finding statistical correlations in texts (either annotated or raw). The vast majority of publications at major conferences on computational linguistics belong to this paradigm. However, to the best of our knowledge relatively few attempts have been made to apply entirely statistical methods to Building Tools for processing 1 However, this story is not entirely correct, see (Jelinek, 2005). 657
S. Sharov, J. Nivre Russian, e. g., (Sokirko and Toldova, 2005, Nivre et al., 2008, Sharoff et al., 2008). Purely statistical approaches to language processing are also very infrequent in the proceedings of Russian conferences (like this one). The paper describes three experiments on designing Russian NLP tools, respectively for Part-Of-Speech (POS) tagging, for lemmatisation and for syntactic parsing. Thus, they cover the basic tools needed for doing NLP and Corpus Linguistics in Russian. The experiments did not exploit any prior knowledge of the Russian language, i. e. we did not use any rules for dealing with any specific Russian phenomenon. Each experiment can be described in the following lines: 1. take an annotated Russian corpus; 2. design a simplified representation of annotations to convert the corpus into the format suitable for the learning tool to be used; 3. learn a model in several iterations to tune the learning parameters. In this approach the human efforts are invested into creating annotated corpora, representing data and designing machine learning algorithms, while the machine is able to learn the links between the data. In the end, linguistic knowledge is induced from annotated corpora rather than explicitly hand-crafted by linguists. In a similar way, development of corpora is possible without manual selection of texts from a range of sources. It can be facilitated by crawling or using the API of a search engine and automatically annotating them with respect to their domains and genres (Baroni et al., 2009, Sharoff, 2010). The automatically induced rules also do not take the form of hard constraints, separating the possible from the impossible, but rather as graded constraints, distinguishing the more probable from the less probable. This makes the automatically acquired models more robust to noise. In the sections below we briefly outline the statistical methods used in each of the three tasks (Section 2), ways of representing corpus phenomena (Section 3) and the results obtained using our tools (Section 4) 2. Methods used 2.1. Statistical part-of-speech tagging POS tagging is aimed at assigning a POS label (tag) to each word in the input stream. Until the end of the 1980s this task had been usually performed by sets of carefully crafted rules for disambiguating the contexts, e. g., for detecting contexts in which the form is a noun (`steel') or a verb (`become'), cf. one of the earliest descriptions of this sort (Nikolaeva, 1958). Ken Church was one of the first researchers to show the possibility of abandoning the rules and relying exclusively on POS-annotated data (Church, 1988). This led to proliferation of statistical approaches to tagging, either using automatic derivation of decision trees, e. g., TreeTagger (Schmid, 658
The proper place of men and machines in language technology 1994), Hidden Markov Models (HMM), e. g., TnT (Brants, 2000), or machine learning, e. g. SVMTool (Gimйnez and Mаrquez, 2004). Probably, the most widely used approach is based on HMM for estimating the probability of a tag from the distribution of words over tags (which tag is more likely for this word), as well as over N-1 adjacent tags, with N often fixed at 3 (a trigram model). For example, given a sentence like: \gll this was engraving on steel \glt `It was a steel engraving', the sequence of tags Noun Preposition Verb is much less likely than the one for Noun Preposition Noun, hence the word in this sentence receives the tag Noun. Still the probability of the sequence Noun Preposition Verb in Russian is greater than zero because of such constructions as ... This study uses the TnT tagger (Brants, 2000). In addition to standard HMM tagging it employs several useful methods for approximating the probabilities of unseen tag sequences (smoothing) as well as for guessing possible tags of unseen words. The latter is done by computing the probability of the last m characters of an unseen word form co-occurring with a given tag. For example, when such forms as vociferation, votazione, , are missing in respective training corpora, they are still more likely to receive the noun tag on the basis of POS tags for words with the same ending. 2.2. Learning lemmatisation rules Lemmatisation rules can be also derived automatically from a list of word forms paired with their possible lemmas and POS tags obtained from an annotated corpus (Erjavec and Dzeroski, 2004, Jongejan and Dalianis, 2009). The CST lemmatiser used in our experiments tries to find for each pair the longest shared part, e. g., for the pair - the inner part is , this leads to the rule *->* (the asterisk indicates any character). The training process then tries to apply the new rule across all pairs with the same POS tag. If lemmatisation is successful, nothing needs to be done, e. g., for -. However, if an applicable rule from the rule base produces incorrect lemmatisation, e. g., for the pair -, the rule *->* produces , which does not match the target lemma, then a new lemmatisation rule is generated to cover more specific cases (there is a special strategy to determine which rules are retained as more general and which cover specific cases). The rule generated in this case *->*, since is shared. Even though the rule is not entirely correct, it is quite unlikely to cause problems in processing real texts, since it fires only when we have a form ending with which gets the tag of a comparative adjective. The training stage runs until all forms in the training set are successfully mapped to their lemmas. 659
S. Sharov, J. Nivre
2.3. Syntactic parsing Syntactic parsing aims at computing a complete hierarchical representation of an input sentence. Statistical methods for parsing has until recently focused on phrase structure parsing for English, resulting in a series of increasingly accurate parsers trained on the Penn Treebank (Magerman, 1995, Collins, 1997, Charniak, 2000, Charniak and Johnson, 2005). However, dependency parsing has emerged as an interesting alternative, especially for languages with more flexible word order than English, as seen in the CoNLL shared tasks on dependency parsing (Buchholz and Marsi, 2006, Nivre et al., 2007). In fact, for decades dependency parsing was the standard approach in the Soviet/Russian linguistic tradition (Mel'cuk, 1988). Most recent approaches to statistical dependency parsing can be characterized as either graph-based or transition-based (McDonald and Nivre, 2007). A graph-based parser learns a model for scoring entire dependency graphs and performs exhaustive search for the highest-scoring graph at parsing time; a typical example is MSTParser (McDonald, 2006). A transition-based parser instead learns a model for predicting the next parser action -- or transition -- and performs greedy search for best transition sequence at parsing time; a typical example is MaltParser (Nivre et al., 2006). Both approaches can give state-of-the-art accuracy, but the transition-based method is potentially much more efficient, which is useful when parsing large amounts of data. The transition-based MaltParser system has previously been applied to Russian with promising empirical results (Nivre et al., 2008).
3. Russian corpora and their representation
3.1. Annotated corpora used for training
Information about the training corpora is given in Table 1. The Russian National Corpus contains a component with morphosyntactic annotation (Plungian, 2005), which is commonly known as (disambiguated). Originally it contained only fiction, but it has been expanded to cover a range of genres, such as newspapers, informal communication (jokes and forums), scientific&technical texts, etc. For training the parsing tool, we used SynTagRus, a Russian corpus with dependency annotation for every sentence (Boguslavsky et al., 2000). This has been produced by using the output of ETAP (Apresian et al., 2003) with manual correction of incorrect analyses.
Table 1. Annotated corpora used in this study
Disambiguated RNC Tokens Orth words Sentences 5 801 316 5 115 016 432 611
Tokens 719 957
SynTagRus Orth words 635 524
Sentences 41 186
660
The proper place of men and machines in language technology 3.2. Adapting the Russian tagset Zalizniak's Grammatical dictionary (Zalizniak, 1977) is a formalisation of Russian morphology, which is commonly used in NLP tools for automatic morphological analysis, e. g., (Segalovich, 2003, Sokirko, 2004). The tagset used in the disambiguated RNC is also largely based on the Zalizniak categories (with few expansions, such as the use of the vocative case). The problem with using statistical taggers is that they usually operate with atomic labels, e. g., NNS in the English Penn tagset stands for `plural common noun', NP stands for `singular proper noun', while the output of morphological analysis is traditionally represented by a set of features, e. g., for mystem (Segalovich, 2003): : =V,=,,,3-, which corresponds to `to slap=Verb,imperfective=nonpast,plural,indicative,3rd person,transitive'. It is possible to produce a tagset by concatenation of the feature set for each word. However, this results in a fairly large number of tags, for example, concatenation of features for all words in the disambiguated RNC produces 4,592 tags, which is too much for trigram tagger learning on a corpus of five million words. The total number of tags reported in (Sokirko and Toldova, 2005) in an experiment, which also used the disambiguated RNC, is 829 tags. This indicates some kind of tagset design, though it is not described in the report. MTE is a project aiming at standardising the tagset for a range of language (Erjavec, 2010), it covers many other Slavonic languages, so the added advantage of using it was the possibility to create a unified tagset. The tagset is positional, i. e., for a major POS (Noun, Verb, etc) there are fixed positions with values for features. For example, Ncfsgn stands for `Noun, common, feminine, singular, genitive, inanimate', while Vmis-sfp stands for `Verb,main,indicative,past,-,singular,feminine,perfective', with the hyphen occupying the place of the person value (which is not detected for the Russian verbs in the past tense). The prepositions are marked for the case of the noun phrase they govern. Example exAmbig receives the following analysis: \gll P--nsnn Vmis-sfa Ncfsnn Sp-l Ncfsln SynTagRus is also a part of the Russian National Corpus, but because of the differences in its morphological categories, it uses a separate query interface. The SynTagRus tagset has been also mapped to a subset of MTE. Given that SynTagRus does not contain the category of pronouns (the personal pronouns in it are coded as nouns, Possessive pronouns as adjectives, etc), its mapping to MTE produces a smaller tagset in comparison to the RNC. So the extra task in this case was to map the RNC-based output of the tagger to the SynTagRus-based set of tags. 661
S. Sharov, J. Nivre 4. Results
4.1. Tagging Out of the 5 million orthographic words of the disambiguated RNC 10 % was kept in the held-out portion used for evaluation. The tagger was trained on the remainder of the disambiguated RNC, and the overall accuracy on the held-out portion was 95.28 % (with punctuation excluded). We also measured the performance of TnT on a reduced tagset of Russian (only codes in Table 2). The accuracy reached 97.09 %, which is only slightly better than the performance of the tagger on the detailed tagset, while the detailed tagset is more beneficial for many NLP tasks.
Code N A P V C R S M Q I
Table 2. Incorrectly assigned POS tags
Explanation Nouns Adjectives Pronouns Verbs Conjunctions Adverbs Prepositions Numerals Particles Interjections
Error rate 2.08 % 0.86 % 0.65 % 0.50 % 0.14 % 0.13 % 0.13 % 0.13 % 0.10 % 0.01 %
Relative error 7.21 % 9.05 % 7.82 % 4.89 % 2.37 % 4.69 % 0.89 % 4.60 % 4.03 % 26.42 %
Coverage 28.80 % 9.51 % 8.28 % 10.16 % 5.84 % 2.81 % 14.62 % 2.81 % 2.59 % 0.02 %
The types of errors produced by the tagger on the full tagset are illustrated in Table 2 and Table 3. The error rate in Table 2 refers to the total count of errors for this category, this is a measure of how important this type of errors is for tagging a text (the table is sorted by this column). It is also interesting to know the amount of word forms within each category tagged incorrectly. This is the relative error rate, which reflects how difficult the category is for the tagger, e. g. 7.21 % rate for nouns means one out of 14 nouns gets a tag which is incorrect in at least one position, while only one out of 112 prepositions (0.89 %) gets a wrong tag (the preposition is not recognised or the case is not assigned correctly). The coverage refers to the total amount of such POS tags in the held-out portion of the RNC, this indicates the relative importance of the category. The evaluation on individual categories reveals that the most difficult POS category is the category of nominals, which includes adjectives and nouns, as well as pronouns, which is a fringe member, including nominal pronouns (P-----n) and attributive pronouns (P-----a) with nominal inflection, as well as adverbial pronouns (P-----r). The apparently high relative error rate for interjections is explained by the
662
The proper place of men and machines in language technology
fact that the two most common interjections are `a' and `o' (ambiguous with a common conjunction and preposition respectively), and their low frequency does not influence the overall error rate much.
Table 3. Most common incorrectly tagged words
0.0932 % 0.0920 % 0.0788 % 0.0682 % 0.0682 % 0.0507 % 0.0444 % 0.0438 % 0.0413 % 0.0413 % 0.0413 % 0.0363 % 0.0357 % 0.0350 % 0.0338 % 0.0325 % 0.0300 % 0.0288 % 0.0288 % 0.0288 % 0.0269 % 0.0263 % 0.0263 % 0.0244 % 0.0244 % 0.0244 % 0.0238 % 0.0238 % 0.0238 % 0.0219 % 0.0219 % 0.0206 % 0.0188 % 0.0181 %
TnT RNC TnT TnT RNC RNC TnT TnT TnT RNC RNC TnT RNC RNC RNC TnT RNC TnT RNC TnT RNC TnT RNC RNC RNC TnT TnT TnT RNC TnT TnT RNC TnT RNC

C P-----r C Ncfsgn Ncfpgn P--nsnn P--nsnn P-----r Ncnpgn Ncmpgn C P--nsnn Q R P--nsan P-3msan P-3-pgn P--nsnn P-----r C C Q C C Ncnpgy Ncnpgn P--nsnn Q C Ncnsan P-----a Ncnsnn P-----a P--nsan
A more detailed look at the sources of errors presented in Table 3 reveals the following problems:
663
S. Sharov, J. Nivre
1. distinguishing between closely related POS classes, such as pronouns and conjunctions (, , , ), similarly for particles (, ); 2. dealing with long-distance dependencies, especially in distinguishing between the nominative and accusative cases (, , ); 3. domain mismatch, when the training corpus and the held-out one referred to different domains (, masculine or neuter, , animate or inanimate); 4. guessing the full tag for abbreviations (, which was plural genitive in the held-out portion of the RNC, but got the tag of singular genitive in the absence of other indicators of plurality); 5. distinguishing between adverbs and short adjectives (e. g., ). In spite of the number of problems in statistical tagging, a recent comparison of several Russian disambiguation tools in (Ljashevskaja et al., 2010) demonstrated its reasonable performance against other disambiguation and lemmatisation tools (our tagger and lemmatiser are reported there under the names of Peru and Pine). The accuracy of POS tagging achieved on that corpus was 97.3 %, which was considerably better than the majority of other (rule-based) systems. In addition to this, the worst performing component of the tagger was the rule-based tokeniser, which incorrectly identified token boundaries and thus decreased the overall performance.
4.2. Lemmatisation
These are the rules generated for the tag Ncmsgy for nouns ending in -: iwonac
-
-
The model for Zalizniak's Index 5 (masculine nouns ending in -) is well-represented, including the regular forms with and without morphological alternation (-, -, -, -, , -), as well as some exceptions, including the irregular - and the occasional forms - (used in Vasily Grossman's "Life and fate") and -, which came from the inability of the lemmatiser to deal with the hyphenated nouns.
664
The proper place of men and machines in language technology
The statistical lemmatiser depends on the output of tagging, but it is moderately tolerant to tagger errors. For example, irrespectively of the error in getting the animacy of in Table 3 it still gets the right lemma. However, the error in getting the gender of leads to incorrect lemmatisation.
Table 4. Parsing results on development set of SynTagRus; labeled attachment score (LAS) and unlabeled attachment score (UAS)
SynTagRus tags, poly-SVM MTE tags, poly-SVM MTE tags, linear SVM
LAS 83.4 82.8 82.2
UAS 89.4 88.8 88.0
4.3. Syntactic parsing Because of the need to tune the parameters during parsing, SynTagRus was split into three parts, the training set (507 986 words), the development set for tuning the parameters (64 196 words) and the test set for the final evaluation (63 342 words). Table 4 shows results on the development set for three different settings with the standard evaluation metrics: labeled attachment score (LAS), the proportion of words that are assigned the correct head and dependency label, and unlabeled attachment score (UAS), the proportion of words that are assigned the correct head (regardless of label). The first experiment replicates the settings from (Nivre et al., 2008) exactly, using the original part-of-speech tags from the SynTagRus treebank and using SVMs with a polynomial kernel to predict the next parser transition.2 The results obtained are slightly better than the ones reported by (Nivre et al., 2008) (LAS 82.3, UAS 89.0), which is probably due to a larger training set. The second experiment uses the same features and the same type of classifier (poly-SVM) but replaces the SynTagRus partof-speech tags with the MTE tags. This results in slightly lower parsing accuracy, about 0.6 percentage points for both metrics. Using SVMs with a polynomial kernel is rather inefficient during both training and parsing. For example, parsing the development set of 68,314 tokens takes about three hours. In the third experiment, we therefore used a linear SVM, together with a slightly extended set of features to compensate for the lack of the polynomial kernel. The result is a much faster parser, which parses the development set in under two minutes, although with slightly lower accuracy. This parsing model will be applied to the Russian Web corpus of about 3 billion words, and it is expected to complete parsing in under two months. 2 Besides part-of-speech tags, the parser uses word forms, lemmas and morphosyntactic features as a basis for prediction; see (Nivre et al., 2008) for more details. 665
S. Sharov, J. Nivre 5. Conclusions This paper presents a fairly radical stance: it is redundant to encode linguistic knowledge explicitly; a completely automatic machine learning procedure can quickly produce a fast and reliable NLP component, which rivals (and in some cases exceeds) the performance of hard-coded linguistic rules requiring the efforts of many personmonths (if not years). Hence, the efforts of linguists need to be spent on creating data rather than writing rules. Nevertheless, this claim needs to be taken with a pinch of salt. First, the approach was reasonably successful since it implicitly utilised some information about the language. The methods for unknown word guessing as well as lemmatisation used in this study rely on the fact that Russian is a flective language. Statistical tagging and lemmatisation are known to be more difficult for agglutinative languages, like Turkish (Dincer et al., 2008). For an isolating language, like Chinese, there is no problem with lemmatisation, but the greater average ambiguity of the POS tags for known words and the lack of reliable prediction of the POS tag for unknown words makes the accuracy of knowledge-free methods considerably lower. Second, data representation in terms of tag labelling is sufficiently simple and efficient, but a tag label lacks information about the internal structure of linguistic phenomena. For example, when the system learns the structure of Russian noun phrases, it does not take into account the agreement in case, number and gender. It only learns the fact that Afpmsg is normally followed by Ncmsgn, Ncmsgy or Npmsgy, while Afpfsd is followed by Ncfsdn, etc. However, if the set of training examples does not contain a proper masculine inanimate noun (Npmsgn) in this sequence, the tagger will fail to treat the sequence of Afpmsg Npmsgn as a noun phrase, even if the concept of animacity is not relevant to the noun phrase construction. Yet another problem in using purely statistical methods is the reliance on patterns present in training data. Each training set has its own peculiarities, which do not necessarily match the peculiarities of the application domain. For example, the impressive accuracy of 97­98 % for HMM tagging is obtained on well-controlled newspaper texts (The Wall Street Journal for English and Frankfurter Rundschau for German), but the accuracy of taggers trained on these corpora drops dramatically on other text genres, down to 85.7 % on Internet forums, i. e., every seventh word is tagged incorrectly (Giesbrecht and Evert, 2009). This does not indicate any inferior status of Internet forums, just the fact that the trigram model trained on newspaper texts does not approximate them well. Annotating texts in the application domain to obtain more training data is expensive, so the tools are often used in new domains without formal evaluation of their accuracy, e. g., ukWac (Baroni et al., 2009) has been tagged and lemmatised with the default TreeTagger model. This problem is partly addressed by new approaches to machine learning using domain adaptation, which uses a training corpus from the source domain (with available annotated data), a small number of annotated examples from the target domain and a large number of unlabelled examples from the target domain (Daumй III et al., 2010). 666
The proper place of men and machines in language technology In addition to the known problem of unknowns in the domain mismatch, there is a problem of unknown knowns, namely when peculiarities inherent in the annotated set are not obvious, while machine learning is likely to emphasise them for making classification decisions. In the end, the system might achieve reasonably good accuracy on the held-out portion of the annotated set (since it is drawn from the same distribution), while this accuracy could be irrelevant outside of the annotated set alone. For example, in the field of automatic genre classification it has been shown that a large number of texts on a particular topic within a genre heading can considerably affect the decisions made by the classifier, e. g., by treating texts on hurricanes and taxation as belonging to FAQs (Wu et al., 2010). At the same time, a classifier based on POS trigrams is much less successful, but it suffers less from the transfer from one annotation set to another (Petrenz and Webber, 2010). Finally, there are problems with correcting the results. An error produced by a rule-based tagger can be corrected by debugging, finding the incorrectly fired rule, modifying it and testing the performance again. A statistical model can be amended by modification of the learning parameters or by providing more data, but this is only indirectly related to the performance of the system in the case of an individual problem. In either case, the main contribution of the paper is two-fold. First, we describe the baseline for natural language processing for Russian using only statistical methods and minimal adjustment to the representation of source data. In spite its minimalism, the baseline outperforms the majority of the rule-based systems (Ljashevskaja et al., 2010). Second, the tools reported in this paper are available for linguistic research.3 This defines the entire pipeline, which starts with POS tagging of pre-tokenised texts, proceeds to lemmatisation and ends with syntactic parsing. Acknowledgements Research reported in this paper was partly funded by European community's Seventh Framework Programme (FP7/2007­2013) under Grant Agreement no 248 005 (TTC)4 and partly by European Community's Life Long Learning Programme (project Kelly, Keywords for language learning for Young and adults alike).5
3 They can be downloaded from http://corpus.leeds.ac.uk/tools 4 http://www.ttc-project.eu 5 http://su.avedas.com/converis/contract/321
667
S. Sharov, J. Nivre References 1. Apresian J., Boguslavskii I., Iomdin L., Lazurskii A., Sannikov V., Sizov V., Tsinman L. 2003. ETAP-3 Linguistic Processor: a Full-fledged NLP Implementation of the MTT. First International Conference on Meaning-Text Theory : 279­288. 2. Baroni M., Bernardini S., Ferraresi A., Zanchetta E. 2009. The WaCky Wide Web: a Collection of Very Large Linguistically Processed Web-Crawled Corpora. Language Resources and Evaluation, 43(3):209­226. 3. Boguslavskii, I., Grigor'eva S., Grigor'ev N., Kreidlin L., Frid N. 2000. Dependency Treebank for Russian: Concept, Tools, Types of Information, 2 : 987­991. 4. Brants T. 2000. TnT -- a Statistical Part-of-Speech Tagger. Proc. of 6th Applied natural language processing Conference : 224­231. 5. Buchholz S., Marsi E. 2006. CoNLL-X Shared Task on Multilingual Dependency Parsing. Proceedings of the 10th Conference on Computational Natural Language Learning (CoNLL) : 149­164. 6. Charniak E. 2000. A Maximum-Entropy-Inspired Parser. Proceedings of the First Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL) : 132­139. 7. Charniak E., Johnson M. 2005. Coarse-to-fine N-best Parsing and MaxEnt Discriminative Reranking. Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL) : 173­180. 8. Church K. 1988. A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text. Proceedings of the Second Conference on Applied Natural Language Processing : 136­143. 9. Collins M. 1997. Three Generative, Lexicalised Models for Statistical Parsing. Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics (ACL) and the 8th Conference of the European Chapter of the Association for Computational Linguistics (EACL) : 16­23. 10. Daumй III H., Kumar A., Saha A. 2010. Frustratingly Easy Semi-Supervised Domain Adaptation. Workshop on Domain Adaptation for Natural Language Processing at ACL2010. 11. Dincer T., Karaoglan B., Kisla T. 2008. A Suffix Based Part-of-speech Tagger for Turkish. Third International Conference on Information Technology: New Generations : 680­685. 12. Erjavec T. 2010. Multext-east Version 4: Multilingual Morphosyntactic Specifications, Lexicons and Corpora. Proceedings of the Seventh conference on International language Resources and Evaluation (LREC'10). 13. Erjavec T., Dzeroski S. 2004. Machine Learning of Morphosyntactic Structure: Lemmatizing unknown Slovene words. Applied Artificial Intelligence, 18(1) : 17­41. 14. Giesbrecht E., Evert S. 2009. Part-of-Speech (POS) Tagging -- a Solved Task? An Evaluation of POS Taggers for the Web as Corpus. Proceedings of the Fifth Web as Corpus Workshop (WAC5) : 27­35. 15. Gimйnez J., Mаrquez L. 2004. SVMTool: A General Pos Tagger Generator Based on Support Vector Machines. Proceedings of the Forth Language Resources and Evaluation Conference. 668
The proper place of men and machines in language technology 16. Jelinek F. 2005. Some of My Best Friends are Linguists. Language Resources and Evaluation, 39(1) : 25­34. 17. Jongejan B., Dalianis H. 2009. Automatic Training of Lemmatization Rules that Handle Morphological Changes in Pre-, In- and Suffixes alike. Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. 18. Kay M. 1997. The Proper Place of Men and Machines in Language Translation. Machine Translation, 12(1­2):3­23. 19. Liashevskaia O., Astaf'eva I., Bonch-Osmolovskaia A., Gareishina A., Iu., G., D'iachkov V., Ionov M., Koroleva A., Kudrinski M., Litiagina A., Luchina E., Sidorova E., Toldova S., Savchuk S., and Koval' S. 2010. Evaluation of Automatic Text Parsing Methods: Morphological Parsers in Russian [Otsenka Metodov Avtomaticheskogo Analiza Teksta: Morfologicheskie Parsery Russkogo Iazyka]. Komp'iuternaia Lingvistika i Intellektual'nye Tekhnologii: Trudy Mezhdunarodnoi Konferentsii "Dialog 2010" (Computational Linguistics and Intelligent Technologies: Proceedings of the International Conference "Dialog 2010") : 318­326. 20. Magerman D. M. 1995. Statistical Decision-tree Models for Parsing. Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics (ACL) : 276­283. 21. McDonald R. 2006. Discriminative Learning and Spanning Tree Algorithms for Dependency Parsing. 22. McDonald R., Nivre J. 2007. Characterizing the Errors of Data-driven Dependency Parsing Models. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL): 122­131. 23. Mel'chuk I. 1988. Dependency Syntax: Theory and Practice. 24. Nikolaeva T. 1958. Soviet Developments in Machine Translation: Russian Sen- tence Analysis. Mechanical Translation, 5(2):51­59. 25. Nivre J., Boguslavskii I. M., Iomdin L. L. 2008. Parsing the SynTagRus Treebank of Russian. Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008) : 641­648. 26. Nivre J., Hall J., Kьbler S., McDonald R., Nilsson J., Riedel S., Yuret D. 2007. The CoNLL 2007 Shared Task on Dependency Parsing. Proceedings of the CoNLL Shared Task of EMNLP-CoNLL 2007 : 915­932. 27. Nivre J., Hall J., Nilsson J. 2006. Maltparser: A Data-driven Parser-Generator for Dependency Parsing. Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC) : 2216­2219. 28. Petrenz P., Webber B. 2010. Stable Classification of Text Genres. Computational Linguistics, 34(4). 29. Plungian V. A. 2005. What do We Need Russian National Corpus for? [Zachem Nuzhen Natsionalnii Korpus Russkogo Iazyka?]. Natsionalnii Korpus Russkogo Iazyka : 6­20. 30. Schmid H. 1994. Probabilistic Part-of-speech Tagging Using Decision Trees. International Conference on New Methods in Language Processing. 669
S. Sharov, J. Nivre 31. Segalovich I. 2003. A Fast Morphological Algorithm with Unknown Word Guess- ing Induced by a Dictionary for a Web Search Engine. Proc. of MLMTA-2003. 32. Sharov S. 2010. In the Garden and in the Jungle: Comparing Genres in the BNC and Internet. Genres on the Web: computational models and empirical studies. 33. Sharov S., Kopotev M., Eriavets T., Feldman A., Diviak D. 2008. Designing and Evaluating a Russian Tagset. Proceedings of the Sixth Language Resources and Evaluation Conference, LREC 2008. 34. Sokirko A. 2004. Morphological Modules on the web-site www.aot. ru [Morphologicheskie Moduli na saite www.aot.ru]. Komp'iuternaia Lingvistika i Intellektual'nye Tekhnologii: Trudy Mezhdunarodnoi Konferentsii "Dialog 2004" (Computational Linguistics and Intelligent Technologies: Proceedings of the International Conference "Dialog 2004"). 35. Sokirko A., Toldova S. 2005. Sravnenie Effektivnosti Dvukh Metodik Sniatiia Lexicheskoi i Morfologicheskoi Neodno znachnosti dlia Russkogo Iazyka. Internet-matematika. 36. Wu Z., Markert K., Sharov S. 2010. Fine-grained Genre Classification using Structural Learning Algorithms. Proc. of ACL 2010. 37. Zalizniak A. 1977. Russian Grammar Dictionary [Grammaticheskii Slovar' Russkogo Iazyka. Russki Iazyk]. 670

File: the-proper-place-of-men-and-machines-in-language-technology-processing.pdf
Title: Dialog'2011.indb
Author: Sarge
Published: Tue Dec 13 19:37:05 2011
Pages: 14
File size: 0.19 Mb


General Systems, 1 pages, 0.27 Mb

, pages, 0 Mb
Copyright © 2018 doc.uments.com