148x Filetype PDF File size 0.22 MB Source: aclanthology.org
Somestatistical methods for evaluating information extraction systems Will Lowe GaryKing Computer Science Department Center for Basic Research Bath University in the Social Sciences wlowe@latte.harvard.edu Harvard University king@harvard.edu Abstract event categories in real data generates severe prob- lems for evaluators. We discuss these problems in We present new statistical methods for section 3, show how to circumvent using a novel evaluating information extraction sys- sampling scheme in section 4, and briefly describe tems. The methods were developed our application. Finally we discuss the advantages to evaluate a system used by polit- and disadvantages of the methods, and their rela- ical scientists to extract event infor- tions to standard evaluation procedure. We start mation from news leads about inter- with a brief review of information extraction in in- national politics. The nature of this ternational relations. data presents two problems for evalu- ators: 1) the frequency distribution of event types in international event data 2 Event Analysis in International is strongly skewed, so a random sample Relations of newsleads will typically fail to con- tain any low frequency events. 2) Man- Researchers in quantitative international relations ual information extraction necessary to have been performing manual information ex- create evaluation sets is costly, and most traction since the mid-1970s (McClelland, 1978; effort is wasted coding high frequency Azar, 1982). The information extracted has re- categories . mained fairly simple; a researcher fills a ’who did We present an evaluation scheme that what to whom’ template, usually from historical overcomes these problems with consid- documents, a list of countries and international erably less manual effort than traditional organizations to describe the actors, and a more methods, and also allows us to interpret or less articulated ontology of international events an information extraction system as an to describe what occurred (McClelland, 1978). estimator (in the statistical sense) and to In the early 1990s automated information extrac- estimate its bias. tion tools mostly replaced manual coding efforts (Schrodt et al., 1994). Information extraction sys- 1 Introduction tems in international relations perform a similar task to those competing in early Message Under- This paper introduces a statistical approach we standing Competitions (Sundheim, 1991, 1992). developed to evaluate information extraction sys- With machine extracted events data it is now pos- tems used to study international relations. Event sible to do near real-time conflict forecasting with extraction is a form of categorization, but the data based on newswire leads, and detailed politi- highly skewed frequency profile of international cal analysis afterwards. 3 EventCategoryDistributions 3.1 StandardEvalution Methods We wanted to evaluate an information extraction The standard evaluation methods developed over system from Virtual Research Associates1. This the course of the Message Understanding Compe- system bundles extraction and visualization soft- titions consist mainly in sample statistics to com- ware with a custom event ontology containing, at pute over the evaluation materials e.g. precision last count, about 200 categories of international and recall, but do not give any guidance for choos- event. ing the materials themselves (Cowie and Lehnert, Wefoundtwoproblemswiththenatureofinter- 1996; Grishman, 1997). This is just done by hand national events data. First, the frequency distribu- bythejudges. Perhaps because the selection ques- tion over the system’s ontology, or indeed several tion is neglected, it is seldom clear what larger other ontologies we considered, is heavily skewed. population the test materials are from (save that it Ahandful of mostly diplomatic event types pre- is the same one as the training examples), and as a dominate, and the frequency of other event types consequence itis unclear what the implications for falls of very sharply: we ran the system over all generalization are when a system obtains a partic- the newsleads in Reuters’ coverage of the Bosnia ular set of scores for precision and recall (Lehnert conflict, and of the approximately 45,000 events and Sundheim, 1991). it extracted, 10,605 were in the category of ’neu- Since this literature did not help us generate tral comment’, 4 of ’apology’ and 35 of ’threat of a suitable evaluation sample, we approached the force’. Thus the relative frequencies of event cat- problem from scratch, and developed a statistical egories in this data can be 2,500 to 1. framework specific to our needs. Also, as these figures suggest, the more inter- 4 Method esting and politically relevant events tend to be of One reasonable-sounding but wrong way to ad- low frequency. This problem is quite general in dress the problem of creating a test set without categorization systems with reasonably articulated having to code tens of thousands of irrelevant sto- category systems, and not specific to international ries is the following: relations. But any dataset with these properties causes an immediate problem for evaluation. 1. Usetheextraction system itself to perform an Ideally we would choose a random subset of initial coding, leads whose events are known with certainty (be- cause we have coded them manually beforehand), 2. Takeasampleoftheoutputthatcoversallthe run the system over them, and then compute var- event types in reasonable quantities, ious sample statistics such as precision and re- 3. Examine each coding to see whether the sys- call2. However, a small randomly chosen subset is very unlikely contain instances of most interest- tem assigned the correct event code. ing events, and so the system’s performance will This looks like it can guarantee a good sample of not be evaluated on them. Given the possible fre- low frequency events at much lower cost to the quency ratios above, the size of subset necessary manual coder; we can just pick a fixed number to ensure reasonable coverage of lower frequency of events from each category and evaluate them. eventcategories isenormous. Putmoreconcretely, However, this method exhibits selection bias. To to construct a test set of news leads the evaluator see this, let M and T be variables indicating which will on average have to code around 2,500 com- event category the Machine (that is, the informa- ments to reach a single apology and about 300 tion extraction system) codes an event into, and comments to find a single threat of force. the True category to which the event actually be- 1 longs. Statistically, the quantity of interest to us is http://www.vranet.com the probability that the machine is correct: 2This paper only evaluates extraction performance on event types, though there would seem to be no reason why a similar approach would not work for actors etc. P(M=i|T =i) (1) This is the probability that the machine classifies 2. Compute P(M) by running the system over an event into category i given that the true event the entire data set and normalizing the fre- coding is indeed i. A full characterization of the quency histogram of event categories success ofthe machine requires knowing P(M =i| 3. Estimate P(M | T) by correcting P(T | M) T =i) for i = 0,...,J, which includes all J event with P(M) using Bayes theorem categories and where i = 0 denotes the situation where the machine is unable to classify an event Our implementation of this scheme was to first into any category. In short, the quantity of interest run the system over 45,000 leads about the Bosnia is the full probability density P(M | T). conflict, and normalize the frequency histogram of In statistical terms, this distribution is a likeli- events extracted to create P(M). Then, randomly hood function for the information extraction sys- choose 5 leads assigned to each event category, tem. This observation allows us to treat the system and manually determine which event type the in- like any other statistical estimator and offers the stantiate. Then normalize to estimate P(T | M). interesting possibility of analyzing generalization And finally, use (3) to create P(M | T). We chose via its sampling properties, e.g. its bias, variance, four times as many uncategorized leads as from meansquared error, or risk. each true category in addition. A larger sample Unfortunately, the problem with the reasonable- here is advisable to see what sort of categories the sounding approach described above is that it does system misses. These sample sizes are fixed, but not in fact allow us to estimate P(M | T) because it may also be possible to use active learning tech- it is implicitly conditioning on M, not T. In par- niques to tune them (as in e.g. Argamon-Engelson ticular, the proportion of events that are actually in and Dagan, 1999) for even more efficient sam- category i among those the machine put in cate- pling. gory i gives us instead an estimate of Theadvantage of this roundabout route to (1) is that it requires many fewer events to be manually P(T | M) (2) coded. We ran the system over 45,000 leads but only manually coded a handful of events for each which is not the quantity of interest. (2) is the category. This guaranteed us even coverage of the probability of the truth being in some event cate- lowest frequency event categories whilst not bias- gory rather than the machine’s response whereas ing the end result – for an ontology with about 200 in fact the true event category is fixed and it is categories this is a substantial decrease in evalua- the machine’s response that is uncertain3. Worse, tor effort. P(T | M) is a systematically biased estimate of This method works by making use of the ex- P(M | T) because these two quantities are related traction system itself to produce one important by Bayes theorem: marginal: P(M). If we assume that the aim is to P(M,T) P(T | M)P(M) evaluate the system on the Bosnia conflict, P(M) P(M|T)= P(T) = P(T) , (3) is not estimated, but is rather an exact population marginal4. Then we can guarantee that our esti- and the only circumstances under which they mate of P(M | T) is unbiased because the method would be equal is when P(M) is uniform. But for estimating P(T | M) is clearly unbiased, and the figures in section 3 suggest that P(M) is highly P(M)addsnoerror. skewed. 4.1 SummaryMeasures However this last observation suggests a better P(M | T) allows the computation of a number method for unbiased estimation of (1). of useful summary measures5. For example, we 1. Estimate P(T | M) as described above 4We might consider the Bosnian conflict to be a sample point from the larger population of all wars, but that popula- 3Thisisduetochangesinthejournalist’schoiceofvocab- tion – if it exists at all – is certainly difficult to quantify. ulary and syntactic construction that are uncorrelated with the 5Detailed discussion of several summary measures for the identity of the event being described. system we evaluated can be found in King and Lowe (2002). can easily compute P(M,T) from quantities al- 10 ready available, so J P(M = i,T = i) is the pro- 8 ∑ portion of time the system extracts the correct 6 category. Alternatively, if it is more important to extract some categories than others, then var- 4 ious weighted measures can be constructed e.g. 2 J ∑ P(M=i|T =i)wi where ws are non-negative gi0 and sum to 1, representing the relative importance of extracting each category. Some more graphi- −2 cal methods of evaluation using P(M | T) are pre- −4 sented below. −6 4.2 Estimator Properties −8 −10 Given a likelihood function for the extraction sys- −10 −8 −6 −4 −2 0 2 4 6 8 10 G tem we can investigate its properties as an esti- i mator. It is particularly useful to know the bias Figure 1: Expected (gi) versus true (Gi) conflict- of an estimator, defined in this case as the dif- cooperation level for each event category. ference between the expected category response from the system when the true event category is aid’, G = 7.4, ‘policy endorsement’ maps to 3.6, i, and i itself, where the expectation is taken of re- i peated information extraction tasks that instantiate ‘halt negotiations’ maps to -3.8, and a ‘military en- the same event categories. We do not examine the gagement’ maps to -10, the maximally conflictual corresponding variance here, and a more complete event. The mapping allows univariate, and polit- evaluation might also address the question of con- ically relevant comparison between the true con- sistency. flict level and that of the event categories the sys- tem extracts. 4.2.1 Conflict and Cooperation The expected system response when the true The machines response and the true category category has conflict/cooperation level Gi is: is best seen as a set of multinomial probabilities J (with a unit vector with the value 1 at the index gi = ∑GjP(M= j|T =i,M6=0) (4) of the system’s extracted category or the true cate- where gory respectively. Estimator properties are cum- P(M|T)1(M6=0) bersome to represent in this format, so here we P(M= j|T =i,M6=0)= P(M6=0|T) . map the system’s response to a single real value corresponding to the level of conflict or coopera- and 1(M 6= 0) is an indicator function equaling 1 tion of the event category. This re-representation if M 6= 0 and 0 otherwise. is usual in international relations and allows stan- AplotofGi against gi for each event category is dard econometric time series methods to be ap- shown in Figure 1. An unbiased estimator would plied (Schrodt and Gerner, 1994; Goldstein and show expected values on the main diagonal. Esti- Freeman, 1990; Goldstein and Pevehouse, 1997). mator bias for event category i is simply gi −Gi. For our purposes it also allows the straightfor- Estimator variance is simply the spread around the wardgraphical presentation of the main ideas. We diagonal. define the level of conflict or cooperation level 4.3 Comparison of an event category i as Gi, a real number be- tween -10 (most conflictual) to 10 (most coopera- We also compared the system’s performance to tive) (see Goldstein, 1992, for the full mapping). 3 undergraduate coders (U1-3) working on the For example, according to this scheme, when i same data set. To examine undergraduate perfor- denotes the event category ‘extending economic mance requires first P(U,T), from which we can
no reviews yet
Please Login to review.