noun phrases, causal relation, causation, semantic relations, linguistic patterns, Semantic constraints, causal relations, semantic relation, NP1, relationship, the relationship, noun, states of affairs, Information Extraction, WordNet, explicit state, automatic identification, verb, causation class, ambiguity, the noun phrase, linking verb, causality, Roxana Girju, Information Extraction systems, Dan Moldovan Computer Science Department University of Texas at Dallas Richardson
From: FLAIRS-02 Proceedings. Copyright © 2002, AAAI (www.aaai.org). All rights reserved.
Text Mining for causal relation
s Roxana Girju and Dan Moldovan Computer Science
Department University of Texas
at Dallas Richardson, Texas [email protected]
, [email protected]
Abstract Given a semantic relation, the automatic extraction of linguistic patterns that express that relation is a rather difficult problem. This paper presents a semi-automatic method of discovering generally applicable lexico-syntactic patterns that refer to the causal relation. The patterns are found automatically, but their validation is done semi-automatically. Introduction The automatic identification of semantic relations in text has become increasingly important in Information Extraction, Question Answering and information retrieval
in the last decade. The MUC competitions have brought a significant contribution to AI, as many Information Extraction systems used new and innovative techniques to discover relevant information from texts. In order to extract the exact answer to user queries, Q&A and IR systems need to synthesize information gathered from multiple documents or to identify new relationships between facts/entities and discover new knowledge. An important semantic relation for all these applications is the causal relation. Although many researchers focused their attention on this semantic relation, they used handcoded patterns to extract causation information from text. This paper is part of a project to automatically discover knowledge from texts. In addition to concepts, the knowledge consists of relationships that express various semantic relations between concepts (e.g., CAUSATION, INFLUENCE, PART-WHOLE, etc.). In this paper we focus only on the causal relation and show a method for automatic detection of causation patterns and a semi-automatic validation of ambiguous lexico-syntactic patterns referring to causation. In the next sections we talk about previous work on causality and describe our approach. Results are presented, and at the end we offer some discussion and conclusions. Previous Work in Computational Linguistics
Broadly speaking, causality refers to the way of knowing if one state of affair causes another. Although the notion of Copyright c 2002, American Association
for Artificial Intelligence
(www.aaai.org). All rights reserved.
causality is very old (beginning with the Aristotle's Metaphysics), over the time it has been surrounded by controversy as scientists and philosophers have not agreed on the definition of causality and when two states of affairs
are causally linked. The theory of causality is very broad, and perhaps the most interesting feature of the work on causation on the last decades has been its diversity. Several theories have been developed resulting in an overwhelming number of publications. This explosion of approaches can be explained in part by the plurality of perspectives the researchers used, and by the diversity of domains to which the causation notion applies: philosophy, statistics, linguistics, physics, economics, biology, medicine, etc. In Computational Linguistics, many previous studies have attempted to extract implicit inter-sentential causeeffect relations from text using knowledge-based inferences (Joskowiscz, Ksiezyk and Grishman 1989), (Kaplan 1991). These studies were based on hand-coded, domain-specific knowledge bases
difficult to scale up for realistic applications. Other researchers (Garcia 1997), (Khoo et al. 2000) used linguistic patterns to identify explicitly expressed causal relations in text without any knowledge-based inference. Garcia focused in 1997 on the extraction of causal relations from French texts, using hand coded lexico-syntactic patterns. She reported a precision of 85%(Garcia 1997). Khoo et al. (Khoo et al. 2000) extracted by hand English linguistic patterns from a MEDICAL DATA
base, reporting an accuracy of about 68%. The Approach The algorithm for the detection of lexico-syntactic patterns that refer to causation consists of two major procedures. The first procedure discovers lexico-syntactic patterns that can express the causal relation, and the second procedure validates and ranks the ambiguous patterns acquired based on semantic constraints on nouns and verbs. Automatic discovery of lexico-syntactic patterns referring to causation The causal relation can be expressed in text in various ways, from explicit to implicit, and from intra to extra-sentential
360 FLAIRS 2002
patterns. One of the most frequent explicit intra-sentential pattern that can express causation is NP1 NP2 . According to two Russian linguists (Nedjalkov and Silnickij 1973) who made multilingual causation studies, the causation verbs can be classified in the following categories: 1. Simple causatives - the linking verb refers only to the causal link, most of the time being synonymous with cause. For example, Earthquakes generate tidal waves. Here the verb "generate" is synonymous with "cause". 2. Resultative causatives - the linking verb refers to the causal link plus a part of the resulting situation. E.g.: kill (cause to die), melt, dry, break, drop, etc. 3. Instrumental causatives - they express a part of the causing event as well as the result. E.g., poison (killing by poisoning), hang, punch, clean, etc. In this paper we focus on explicit intra-sentential syntactic-patterns of the form NP1 NP2 , where the verb is a simple causative. In order to catch the most frequently used lexico-syntactic patterns referring to causation, we used a modified version
of the Hearst's procedure (Hearst 1998), as described below: Procedure 1. Discovery of lexico-syntactic patterns: 1. Pick a semantic relation R (e.g., CAUSATION) 2. Pick a pair of noun phrase
s , among which R holds. In order to get as many causation patterns as possible, we repeated step 2 for a list of noun phrases extracted from WordNet 1.7. WordNet(Miller 1995) contains 17 semantic relations: IS-A, reverse ISA, MERONYMY/HOLONYMY, ENTAIL, CAUSE-TO, ATTRIBUTE, PERTAINYMY, ANTONYMY, SYNSET (SYNONYMY), etc. The CAUSE-TO relation is a transitive relation between verb synsets. For example, in WordNet the second sense of the verb develop is causes to grow. Given the fact that almost all these verbs have nominalizations, it is easy to find noun concepts among which the WordNet causal relations hold. Although WordNet contains numerous causal relationships between nouns that are always true, they are not directly mentioned. One way to determine such relationships is to look for all patterns that occur between a noun entry and another noun in the corresponding gloss definition. One such example is the causal relationship between bonyness and starvation . The gloss of bonyness (#1/1) is (extreme leanness (usually caused by starvation or disease)). WordNet 1.7 contains 429 such relations linking nouns from different domains, the most frequent being medicine (about 58.28%). 3. Extract lexico-syntactic patterns that link the two selected noun phrases by searching a collection of texts. For each pair of causation nouns determined above, search the Internet or any other collection of documents. Retain only the sentences containing the pair. From these sentences, determine automatically all the patterns NP1
verb/verb expression NP2 , where NP1 - NP2 is the pair considered. The result is a list of verbs
that refer to causation. Some of these verbs are always referring to causation, but most of them are ambiguous, in the sense that they express a causation relation only in a particular context and only between specific pairs of nouns. For example, NP1 causes NP2 refers always to causation, but this is not true for NP1 produces NP2 . In most cases, the verb produce has the sense of manufacture, but in some particular contexts it refers to causation. In her procedure, Hirst selects the patterns by hand and applies them to text without making any semantic filtering on the relationships obtained. In this approach, the acquisition of linguistic patterns is done automatically, as the pattern is predefined (NP1 verb NP2). As is described in the next subsection, the relationships are disambiguated and ranked and only those referring to causation are retained.
Validation of causation patterns and ranking of causation relationships
Because the exact disambiguation of the verb sense is often very difficult, we try to validate the lexico-syntactic patterns using a coarse-grain approach. The approach consists of detecting the constraints necessary and sufficient on nouns and verb for the pattern NP1 NP2 such that the lexicosyntactic pattern indicates a causal relationship.
Semantic constraints on nouns
The basic idea we employ here is that only some cate-
gories of noun phrases can be associated with a causation
link. According to the philosophy researcher Jaegwon Kim
(Kim 1993), any discussion of causation implies an ontolog-
ical framework of entities among which causal relations are
to hold, and also "an accompanying logical and semantical
framework in which these entities can be talked about". He
argues that the entities that represent either causes or effects
are often events, but also conditions, states, phenomena, pro-
cesses, and sometimes even facts, and that coherent causal
talk is possible only within a coherent ontological frame-
work of such states of affairs.
In a relationship of the form NP1 NP2 , the nouns
NP1 (cause noun) and NP2 (effect noun), can express ex-
plicit or implicit states of affairs. The following four situa-
tions can occur:
1. cause noun and effect noun are explicit states of affairs. e.g: Earthquakes cause tidal waves. 2. effect noun expresses an explicit state of affair, and cause noun an implicit one. e.g: John caused the disturbance.
3. cause noun shows an explicit state of affair, and effect noun an implicit one. e.g: Sometimes rain can cause you bad days.
4. cause noun and effect noun are implicit states of affairs. e.g: John caused her really bad days.
FLAIRS 2002 361
Examples 2 and 4 denote a causal relationship as the verb caused indicates, but the relation is not explicit. John cannot cause directly a psychological state (e.g., the disturbance), but the action John undertook caused it. In this paper we focus only on the situations 1 and 2, as they are the most frequently used in texts. Given this approach the system selects automatically the causation classes with the following procedure:
STEP 1. Semantic constraints on For each noun occupying the EFFECT position in the causation pairs detected in step 1 of Procedure 1, select as causation class the most general subsumer in WordNet for that given sense. For example, the most general subsumer of the word excitement (#1/4) in WordNet is psychological feature. In WordNet, all the EFFECT nouns in the causation pairs represent entities that express explicit states of affairs. At the end of this step, the system detected the following causation classes: human action
, phenomenon, state, psychological feature, and event. Our assumption is that these classes represent causation categories, and anything else that is not in this list refers to noncausation.
STEP 2.Semantic constraints on
We noticed from the corpus created in Procedure 1 that
metonymy occurs with high frequency in causal relation-
ships, but mostly on the CAUSE position, and quite rarely
on the EFFECT position.
This observation is also supported by the large number
of classes obtained for the
nouns on the cause posi-
tion with the procedure describe above. This shows that the
CAUSE nouns can be represented by almost any noun. Thus,
we use here only a soft constraint which would help vali-
date the relationships in some special cases explained later
in section 4:
soft constraint on CAUSE: the noun should have as subsumer
the concept causal agent in WordNet. For example, the sec-
ond most general subsumer of the word drug in WordNet is
Semantic constraints on verbs We ranked the verbs/verb expressions extracted in step 3 of Procedure 1 based on their ambiguity and frequency levels in WordNet. In WordNet, verbs are represented in synsets, which are lists of synonyms for that verb, and each verb can have multiple senses. For a given verb, in WordNet 1.7 the senses are ranked based on the number of times each sense occurs in the semantically tagged corpus used by the WordNet lexicographers. Based on the observation on WordNet of the extracted verbs, we considered the following categories of constraints along with their thresholds:
1. low ambiguity: - if the number of senses for the verb considered 2. high ambiguity: if the number of senses for the verb considered 3. low frequency: - if (the frequency for that particular sense the sum of the frequency of all other senses) or (the frequency for that particular sense 30)
4. high frequency: if (the frequency for that particular sense the sum of the frequency of all other senses) or (the frequency for that particular sense 30) Table 1 shows a part of the verbs extracted with Procedure 1 ranked according with the constraints defined above. For example, the verb make is ranked at the end because it is highly ambiguous (there are 49 senses in WordNet 1.7 for this verb) and occurs with high frequency (79 occurrences in WordNet tagged corpus). Thus, the sentence "Greenspan makes a recession" is highly ambiguous as it can be interpreted in two ways: either (1) as a causal relation if recession has the sense #1/4 (the state of the economy declines), or (2) as noncausative relation if recession has the sense #2/4 (a small concavity).
The algorithm for the validation and ranking of the causal
relationships is an iterative procedure in which a step is fol-
lowed if the condition in the previous step was not satisfied.
In this algorithm we consider as
head noun of the noun phrases extracted, as it occurs in
WordNet (e.g, for the noun phrase "giant tidal wave", tidal
wave is automatically selected).
Step 1. If the EFFECT and CAUSE head nouns are monosemous and they belong to one of the causation classes, or are polisemous and all their senses belong to the causation classes, then classify the relationship as causation of rank 1.
For example, "Hitler's invasion of Poland provoked the Second World War
Here, both invasion and Second World War have all their senses in causation classes, so even if the verb provoke is ambiguous, the relationship is detected as causation.
Step 2. If the EFFECT head noun is monosemous and it belongs to one of the causation classes, or is polisemous and all its senses belong to the causation classes, then classify the relationship as causation of rank 2.
For example, "In 1958, it was Bleustein-Blanchet who sparked a controversy when he opened Le Drugstore, the American-inspired combination pharmacy, all-hours restaurant and gift store that now has branches at both ends of the avenue". Here, the causal relation is obvious as controversy is monosemous and its sense has the semantic class human action. Step 3. If the EFFECT is represented by an enumeration of noun phrases and the head noun of at least one of them has all the senses in one of the causation classes, than the others also refer to causation in that context. Classify the relationship as causation of rank 3.
For example, in the sentence "Fed will induce a recession and unemployment" the effect unemployment is monosemous and belongs to the causation class state. Thus, the
362 FLAIRS 2002
Low ambiguity High frequency induce give rise (to) produce generate effect bring about provoke arouse elicit lead (to) trigger derive (from) associate (with) relate (to) link (to) stem (from) originate bring forth lead up trigger off bring on result (from)
Low ambiguity Low frequency stir up entail contribute to set up trigger off commence set off set in motion bring on conduce to educe originate in lead off spark spark off evoke link up implicate (in) activate actuate kindle fire up stimulate call forth unleash effectuate kick up give birth (to) call down put forward
High ambiguity Low frequency create launch develop bring
High ambiguity High frequency start make begin rise
Table 1: Ambiguous causation verbs ranked based on ambiguity and frequency. The ambiguity increases from the left most column to the right.
effect noun recession is disambiguated and its interpretation as sense #2 niche, corner is eliminated. Step 4. If the noun phrase representing the EFFECT is ambiguous (at least one of its senses does not belong to a causation class) and the CAUSE respects the soft constraint defined in the previous section, then classify the relationship as causation of rank 4. For example, in the sentence "The drugs induce the growth of muscle tones", the head noun growth has two senses (#4/7 and #7/7) that are in two noncausation classes (e.g., group, grouping , and respectively entity ). In this case, the noun drugs disambiguates the relationship as it is monosemous and has causal agent as one of its hypernyms. Step 5. At this point, the remaining nouns representing the Cause and effect
are ambiguous and the only possibility of disambiguation comes from the restrictions imposed on the verbs. For example, in the sentence "The issue gives rise to a big concern'", both the CAUSE and EFFECT are ambiguous. The noun issue can be "an important question that is in dis-
pute and must be settled" (psychological feature, cf. WordNet), or "one of a series published periodically" (entity, cf. WordNet). The noun concern can refer to an anxious feeling (psychological feature, cf. WordNet), or commercial or industrial enterprise (group, grouping). In this case the relationship is considered causation only because the verb give rise is one of the less ambiguous and highly frequent verbs considered. For all the remaining relationships, classify them based on the verbs' ranking shown in Figure 1. Results In this section we show the results obtained by the validation and ranking algorithm. For this experiment we used the TREC-9 (TREC-9 2000) collection of texts which contains 3GB of news articles from Wall Street Journal
, Financial Times
, Financial Report
, etc. Using the causation verbs obtained in step 3 of Procedure 1, the system formed queries and searched the TREC collection. This way, for each verb there were selected 50 sentences that contained it. The new corpus thus formed (3,000 sentences) was part-of-speech tagged and parsed. For each head of the noun phrases in the CAUSE and EFFECT positions, the system determined automatically the most general subsumer for each sense. The al-
FLAIRS 2002 363
gorithm presented in section 4 was implemented and the sys-
tem gave as output 1,321 causal relationships
, ranked by generality.
The results were validated by comparison with human an-
notation. We asked two subjects, other than the authors, to
rank a list of 300 relationships from which only 230 were
referring to causation, as detected by our algorithm. Out of
the 300 relationships the subjects selected as causal relation-
ships only 151 on average (Table 2). In what concerns the
rating of the causal relationships, it differed from one sub-
ject to another with about 36%, and from the system's output
The accuracy obtained by our system in comparison with
the average of two human annotations was 65.6%.
Rank 1 Rank 2 Rank 3 Rank 4 Total
System 37 73 28 92 230
Human annotator 1 162 (70.43%)
Human annotator 2 140 (60.87%)
Table 2: Comparison with human annotation and accuracy obtained for the 230 causal relationships (the percentages in parentheses represent the accuracy obtained by the system reported to the human annotator).
Discussion and Conclusions The approach presented in this paper for the detection and validation of causation patterns is a novel one. Even if the method is semi-automatic, it brings considerable improvement in time and user work compared with other previous attempts (Garcia 1997), (Khoo et al. 2000). Khoo at al. obtained a better accuracy, but they restricted their text corpus to a medical database and did not handle the ambiguity problem. Our method discovers automatically generally applicable lexico-syntactic patterns referring to causation and disambiguates the causal relationships obtained from the patterns application on text. We intend to extend the analysis to other causation patterns and devise a general algorithm for the detection and especially for the validation of causation patterns. We also consider to test the method for other semantic relations like PART-OF and INFLUENCE. References Marti Hearst. Automated Discovery of WordNet Relations. In WordNet: An Electronic lexical database
and Some of its Applications, editor Fellbaum, C., MIT Press, 1998. D. Garcia. COATIS, an NLP system to locate expressions of actions connected by causality links. In Knowledge Acquisition, Modeling and Mangement, Proceedings of the Tenth European Workshop
, EKAW '97, pages 347-352. L. Joskowiscz, T. Ksiezyk and R. Grishman. Deep domain model
s for discourse anaysis. In The Annual AI Systems in Government Conference, Silver Spring MD, pages 195-200.
R.M. Kaplan, and G. Berry-Rogghe. Knowledge-based acquisition of causal relationships in text. In Knowledge Acquisition, 3(3), 317-337. Christopher Khoo, Syin Chan and Yun Niu. Extracting Causal Knowledge from a Medical Database Using Graphical Patterns In Proceedings of 38th Annual Meeting of the ACL, Hong Kong
, 2000, pages 336-343. Jaegwon Kim. Causes and Events: Mackie on Causation. In Causation, Oxford Readings in Philosophy, ed. Ernest Sosa, and Michael Tooley, Oxford University
Press, 1993. G.A. Miller. WordNet: A Lexical Database. Communication of the ACM, vol 38: No11, pages 3941, 1995. V.P. Nedjalkov and G. Silnickij. The topology of causative constructions. In Folia Linguistica (6), 1973, pages 273-290 (German translation
) Text REtrieval Conference. http://trec.nist.gov 2000
364 FLAIRS 2002
R Girju, DI Moldovan