Types Of Sampling Pdf 85347

Partial capture of text on file.
                          Somestatistical methods for evaluating information extraction systems
                                              Will Lowe                                    GaryKing
                                    Computer Science Department                    Center for Basic Research
                                            Bath University                          in the Social Sciences
                                    wlowe@latte.harvard.edu                            Harvard University
                                                                                      king@harvard.edu
                                           Abstract                          event categories in real data generates severe prob-
                                                                             lems for evaluators. We discuss these problems in
                             We present new statistical methods for          section 3, show how to circumvent using a novel
                             evaluating information extraction sys-          sampling scheme in section 4, and brieﬂy describe
                             tems.   The methods were developed              our application. Finally we discuss the advantages
                             to evaluate a system used by polit-             and disadvantages of the methods, and their rela-
                             ical scientists to extract event infor-         tions to standard evaluation procedure. We start
                             mation from news leads about inter-             with a brief review of information extraction in in-
                             national politics. The nature of this           ternational relations.
                             data presents two problems for evalu-
                             ators: 1) the frequency distribution of
                             event types in international event data         2   Event Analysis in International
                             is strongly skewed, so a random sample              Relations
                             of newsleads will typically fail to con-
                             tain any low frequency events. 2) Man-          Researchers in quantitative international relations
                             ual information extraction necessary to         have been performing manual information ex-
                             create evaluation sets is costly, and most      traction since the mid-1970s (McClelland, 1978;
                             effort is wasted coding high frequency          Azar, 1982). The information extracted has re-
                             categories .                                    mained fairly simple; a researcher ﬁlls a ’who did
                             We present an evaluation scheme that            what to whom’ template, usually from historical
                             overcomes these problems with consid-           documents, a list of countries and international
                             erably less manual effort than traditional      organizations to describe the actors, and a more
                             methods, and also allows us to interpret        or less articulated ontology of international events
                             an information extraction system as an          to describe what occurred (McClelland, 1978).
                             estimator (in the statistical sense) and to     In the early 1990s automated information extrac-
                             estimate its bias.                              tion tools mostly replaced manual coding efforts
                                                                             (Schrodt et al., 1994). Information extraction sys-
                        1 Introduction                                       tems in international relations perform a similar
                                                                             task to those competing in early Message Under-
                        This paper introduces a statistical approach we      standing Competitions (Sundheim, 1991, 1992).
                        developed to evaluate information extraction sys-    With machine extracted events data it is now pos-
                        tems used to study international relations. Event    sible to do near real-time conﬂict forecasting with
                        extraction is a form of categorization, but the      data based on newswire leads, and detailed politi-
                        highly skewed frequency proﬁle of international      cal analysis afterwards.
             3 EventCategoryDistributions                             3.1  StandardEvalution Methods
             We wanted to evaluate an information extraction          The standard evaluation methods developed over
             system from Virtual Research Associates1. This           the course of the Message Understanding Compe-
             system bundles extraction and visualization soft-        titions consist mainly in sample statistics to com-
             ware with a custom event ontology containing, at         pute over the evaluation materials e.g. precision
             last count, about 200 categories of international        and recall, but do not give any guidance for choos-
             event.                                                   ing the materials themselves (Cowie and Lehnert,
                Wefoundtwoproblemswiththenatureofinter-               1996; Grishman, 1997). This is just done by hand
             national events data. First, the frequency distribu-     bythejudges. Perhaps because the selection ques-
             tion over the system’s ontology, or indeed several       tion is neglected, it is seldom clear what larger
             other ontologies we considered, is heavily skewed.       population the test materials are from (save that it
             Ahandful of mostly diplomatic event types pre-           is the same one as the training examples), and as a
             dominate, and the frequency of other event types         consequence itis unclear what the implications for
             falls of very sharply: we ran the system over all        generalization are when a system obtains a partic-
             the newsleads in Reuters’ coverage of the Bosnia         ular set of scores for precision and recall (Lehnert
             conﬂict, and of the approximately 45,000 events          and Sundheim, 1991).
             it extracted, 10,605 were in the category of ’neu-         Since this literature did not help us generate
             tral comment’, 4 of ’apology’ and 35 of ’threat of       a suitable evaluation sample, we approached the
             force’. Thus the relative frequencies of event cat-      problem from scratch, and developed a statistical
             egories in this data can be 2,500 to 1.                  framework speciﬁc to our needs.
                Also, as these ﬁgures suggest, the more inter-        4   Method
             esting and politically relevant events tend to be of     One reasonable-sounding but wrong way to ad-
             low frequency. This problem is quite general in          dress the problem of creating a test set without
             categorization systems with reasonably articulated       having to code tens of thousands of irrelevant sto-
             category systems, and not speciﬁc to international       ries is the following:
             relations.  But any dataset with these properties
             causes an immediate problem for evaluation.                1. Usetheextraction system itself to perform an
                Ideally we would choose a random subset of                 initial coding,
             leads whose events are known with certainty (be-
             cause we have coded them manually beforehand),             2. Takeasampleoftheoutputthatcoversallthe
             run the system over them, and then compute var-               event types in reasonable quantities,
             ious sample statistics such as precision and re-           3. Examine each coding to see whether the sys-
             call2. However, a small randomly chosen subset
             is very unlikely contain instances of most interest-          tem assigned the correct event code.
             ing events, and so the system’s performance will         This looks like it can guarantee a good sample of
             not be evaluated on them. Given the possible fre-        low frequency events at much lower cost to the
             quency ratios above, the size of subset necessary        manual coder; we can just pick a ﬁxed number
             to ensure reasonable coverage of lower frequency         of events from each category and evaluate them.
             eventcategories isenormous. Putmoreconcretely,           However, this method exhibits selection bias. To
             to construct a test set of news leads the evaluator      see this, let M and T be variables indicating which
             will on average have to code around 2,500 com-           event category the Machine (that is, the informa-
             ments to reach a single apology and about 300            tion extraction system) codes an event into, and
             comments to ﬁnd a single threat of force.                the True category to which the event actually be-
                1                                                     longs. Statistically, the quantity of interest to us is
                 http://www.vranet.com                                the probability that the machine is correct:
                2This paper only evaluates extraction performance on
             event types, though there would seem to be no reason why
             a similar approach would not work for actors etc.                       P(M=i|T =i)                       (1)
                              This is the probability that the machine classiﬁes                     2. Compute P(M) by running the system over
                              an event into category i given that the true event                         the entire data set and normalizing the fre-
                              coding is indeed i. A full characterization of the                         quency histogram of event categories
                              success ofthe machine requires knowing P(M =i|                         3. Estimate P(M | T) by correcting P(T | M)
                              T =i) for i = 0,...,J, which includes all J event                          with P(M) using Bayes theorem
                              categories and where i = 0 denotes the situation
                              where the machine is unable to classify an event                    Our implementation of this scheme was to ﬁrst
                              into any category. In short, the quantity of interest               run the system over 45,000 leads about the Bosnia
                              is the full probability density P(M | T).                           conﬂict, and normalize the frequency histogram of
                                 In statistical terms, this distribution is a likeli-             events extracted to create P(M). Then, randomly
                              hood function for the information extraction sys-                   choose 5 leads assigned to each event category,
                              tem. This observation allows us to treat the system                 and manually determine which event type the in-
                              like any other statistical estimator and offers the                 stantiate.    Then normalize to estimate P(T | M).
                              interesting possibility of analyzing generalization                 And ﬁnally, use (3) to create P(M | T). We chose
                              via its sampling properties, e.g. its bias, variance,               four times as many uncategorized leads as from
                              meansquared error, or risk.                                         each true category in addition. A larger sample
                                 Unfortunately, the problem with the reasonable-                  here is advisable to see what sort of categories the
                              sounding approach described above is that it does                   system misses. These sample sizes are ﬁxed, but
                              not in fact allow us to estimate P(M | T) because                   it may also be possible to use active learning tech-
                              it is implicitly conditioning on M, not T. In par-                  niques to tune them (as in e.g. Argamon-Engelson
                              ticular, the proportion of events that are actually in              and Dagan, 1999) for even more efﬁcient sam-
                              category i among those the machine put in cate-                     pling.
                              gory i gives us instead an estimate of                                 Theadvantage of this roundabout route to (1) is
                                                                                                  that it requires many fewer events to be manually
                                                     P(T | M)                             (2)     coded. We ran the system over 45,000 leads but
                                                                                                  only manually coded a handful of events for each
                              which is not the quantity of interest. (2) is the                   category. This guaranteed us even coverage of the
                              probability of the truth being in some event cate-                  lowest frequency event categories whilst not bias-
                              gory rather than the machine’s response whereas                     ing the end result – for an ontology with about 200
                              in fact the true event category is ﬁxed and it is                   categories this is a substantial decrease in evalua-
                              the machine’s response that is uncertain3. Worse,                   tor effort.
                              P(T | M) is a systematically biased estimate of                        This method works by making use of the ex-
                              P(M | T) because these two quantities are related                   traction system itself to produce one important
                              by Bayes theorem:                                                   marginal: P(M). If we assume that the aim is to
                                                  P(M,T)        P(T | M)P(M)                      evaluate the system on the Bosnia conﬂict, P(M)
                                  P(M|T)= P(T) =                      P(T)         ,      (3)     is not estimated, but is rather an exact population
                                                                                                  marginal4. Then we can guarantee that our esti-
                              and the only circumstances under which they                         mate of P(M | T) is unbiased because the method
                              would be equal is when P(M) is uniform. But                         for estimating P(T | M) is clearly unbiased, and
                              the ﬁgures in section 3 suggest that P(M) is highly                 P(M)addsnoerror.
                              skewed.                                                             4.1    SummaryMeasures
                                 However this last observation suggests a better                  P(M | T) allows the computation of a number
                              method for unbiased estimation of (1).                              of useful summary measures5. For example, we
                                 1. Estimate P(T | M) as described above                              4We might consider the Bosnian conﬂict to be a sample
                                                                                                  point from the larger population of all wars, but that popula-
                                  3Thisisduetochangesinthejournalist’schoiceofvocab-              tion – if it exists at all – is certainly difﬁcult to quantify.
                              ulary and syntactic construction that are uncorrelated with the         5Detailed discussion of several summary measures for the
                              identity of the event being described.                              system we evaluated can be found in King and Lowe (2002).
                can easily compute P(M,T) from quantities al-                         10
                ready available, so        J P(M = i,T = i) is the pro-                8
                                        ∑
                portion of time the system extracts the correct                        6
                category.     Alternatively, if it is more important
                to extract some categories than others, then var-                      4
                ious weighted measures can be constructed e.g.                         2
                  J
                ∑ P(M=i|T =i)wi where ws are non-negative                            gi0
                and sum to 1, representing the relative importance
                of extracting each category. Some more graphi-                        −2
                cal methods of evaluation using P(M | T) are pre-                     −4
                sented below.                                                         −6
                4.2    Estimator Properties                                           −8
                                                                                      −10
                Given a likelihood function for the extraction sys-                    −10  −8   −6    −4   −2    0    2    4    6    8    10
                                                                                                                 G
                tem we can investigate its properties as an esti-                                                 i
                mator. It is particularly useful to know the bias                   Figure 1: Expected (gi) versus true (Gi) conﬂict-
                of an estimator, deﬁned in this case as the dif-                    cooperation level for each event category.
                ference between the expected category response
                from the system when the true event category is                     aid’, G = 7.4, ‘policy endorsement’ maps to 3.6,
                i, and i itself, where the expectation is taken of re-                       i
                peated information extraction tasks that instantiate                ‘halt negotiations’ maps to -3.8, and a ‘military en-
                the same event categories. We do not examine the                    gagement’ maps to -10, the maximally conﬂictual
                corresponding variance here, and a more complete                    event. The mapping allows univariate, and polit-
                evaluation might also address the question of con-                  ically relevant comparison between the true con-
                sistency.                                                           ﬂict level and that of the event categories the sys-
                                                                                    tem extracts.
                4.2.1    Conﬂict and Cooperation                                       The expected system response when the true
                   The machines response and the true category                      category has conﬂict/cooperation level Gi is:
                is best seen as a set of multinomial probabilities                               J
                (with a unit vector with the value 1 at the index                         gi = ∑GjP(M= j|T =i,M6=0)                            (4)
                of the system’s extracted category or the true cate-                where
                gory respectively. Estimator properties are cum-                                                       P(M|T)1(M6=0)
                bersome to represent in this format, so here we                     P(M= j|T =i,M6=0)=                    P(M6=0|T)            .
                map the system’s response to a single real value
                corresponding to the level of conﬂict or coopera-                   and 1(M 6= 0) is an indicator function equaling 1
                tion of the event category. This re-representation                  if M 6= 0 and 0 otherwise.
                is usual in international relations and allows stan-                   AplotofGi against gi for each event category is
                dard econometric time series methods to be ap-                      shown in Figure 1. An unbiased estimator would
                plied (Schrodt and Gerner, 1994; Goldstein and                      show expected values on the main diagonal. Esti-
                Freeman, 1990; Goldstein and Pevehouse, 1997).                      mator bias for event category i is simply gi −Gi.
                   For our purposes it also allows the straightfor-                 Estimator variance is simply the spread around the
                wardgraphical presentation of the main ideas. We                    diagonal.
                deﬁne the level of conﬂict or cooperation level                     4.3    Comparison
                of an event category i as Gi, a real number be-
                tween -10 (most conﬂictual) to 10 (most coopera-                    We also compared the system’s performance to
                tive) (see Goldstein, 1992, for the full mapping).                  3 undergraduate coders (U1-3) working on the
                For example, according to this scheme, when i                       same data set. To examine undergraduate perfor-
                denotes the event category ‘extending economic                      mance requires ﬁrst P(U,T), from which we can
The words contained in this file might help you see if this file matches what you are looking for:

...Somestatistical methods for evaluating information extraction systems will lowe garyking computer science department center basic research bath university in the social sciences wlowe latte harvard edu king abstract event categories real data generates severe prob lems evaluators we discuss these problems present new statistical section show how to circumvent using a novel sys sampling scheme and briey describe tems were developed our application finally advantages evaluate system used by polit disadvantages of their rela ical scientists extract infor tions standard evaluation procedure start mation from news leads about inter with brief review national politics nature this ternational relations presents two evalu ators frequency distribution types international analysis is strongly skewed so random sample newsleads typically fail con tain any low events man researchers quantitative ual necessary have been performing manual ex create sets costly most traction since mid s mcclelland eff...
Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area