jagomart
digital resources
picture1_Korean Pdf 98232 | W04 1809


 114x       Filetype PDF       File size 0.12 MB       Source: aclanthology.org


File: Korean Pdf 98232 | W04 1809
computerm 2004 poster session 3rd international workshop on computational terminology 71 term extraction from korean corpora via japanese atsushi fujii tetsuya ishikawa jong hyeok lee graduate school of library division ...

icon picture PDF Filetype PDF | Posted on 21 Sep 2022 | 3 years ago
Partial capture of text on file.
                             CompuTerm 2004 Poster Session  -  3rd International Workshop on Computational Terminology        71
                             Term Extraction from Korean Corpora via Japanese
                          Atsushi Fujii, Tetsuya Ishikawa                            Jong-Hyeok Lee
                               Graduate School of Library,                       Division of Electrical and
                              Information and Media Studies                       Computer Engineering,
                                   University of Tsukuba            Pohang University of Science and Technology,
                                   1-2 Kasuga, Tsukuba           Advanced Information Technology Research Center
                                      305-8550, Japan                          San 31 Hyoja-dong Nam-gu,
                            {fujii,ishikawa}@slis.tsukuba.ac.jp            Pohang 790-784, Republic of Korea
                                                                                    jhlee@postech.ac.kr
                                      Abstract                            large number of foreign words in Korean can also be
                 This paper proposes a method to extract foreign          foreign words in Japanese.
                 words, such as technical terms and proper nouns,           In addition, the foreign words in Korean and
                 from Korean corpora and produce a Japanese-              Japanese corresponding to the same source word are
                 Korean bilingual dictionary.    Specific words have       phonetically similar. For example, the English word
                 been imported into multiple countries simultane-         “system” has been imported into both Japanese and
                 ously, if they are influential across cultures.   The     Korean. The romanized words are /sisutemu/and
                 pronunciation of a source word is similar in different    /siseutem/ in both countries, respectively.
                 languages. Our method extracts words in Korean             Motivated by these assumptions, we propose a
                 corpora that are phonetically similar to Katakana        method to extract foreign words in Korean corpora
                 words, whichcaneasilybeidentifiedinJapanesecor-           by means of Japanese. In brief, our method per-
                 pora. We also show the effectiveness of our method        forms as follows. First, foreign words in Japanese
                 by means of experiments.                                 are collected, for which Katakana words in corpora
                                                                          and existing lexicons can be used. Second, from Ko-
                 1 Introduction                                           rean corpora the words that are phonetically similar
                                                                          to Katakana words are extracted. Finally, extracted
                 Reflecting the rapid growth in science and tech-          Koreanwordsarecompiledinalexiconwiththecor-
                 nology, new words have progressively been created.       responding Japanese words.
                 However, due to the limitation of manual compila-          Insummary,ourmethodcanextractforeignwords
                 tion, new words are often out-of-dictionary words        in Korean and produce a Japanese-Korean bilingual
                 and decrease the quality of human language tech-         lexicon in a single framework.
                 nology, such as natural language processing, infor-
                 mation retrieval, machine translation, and speech        2 Methodology
                 recognition.   To resolve this problem, a number         2.1   Overview
                 of automatic methods to extract monolingual and
                 bilingual lexicons from corpora have been proposed       Figure 1 exemplifies our extraction method, which
                 for various languages.                                   produces a Japanese-Korean bilingual lexicon using
                    In this paper, we focus on extracting foreign words   a Korean corpus and Japanese corpus and/or lexi-
                 (or loanwords) in Korean.      Technical terms and       con. The Japanese and Korean corpora do not have
                 proper nouns are often imported from foreign lan-        to be parallel or comparable. However, it is desir-
                 guages and are spelled out (or transliterated) by the    able that both corpora are associated with the same
                 Korean alphabet system called Hangul. The similar        domain. For the Japanese resource, the corpus and
                 trend can be observable in Japanese and Chinese. In      lexicon can alternatively be used or can be used to-
                 Japanese, foreign words are spelled out by its special   gether. Note that compiling Japanese monolingual
                 phonetic alphabet (or phonogram) called Katakana.        lexicon is less expensive than that for a bilingual lex-
                 Thus, foreign words can be extracted from Japanese       icon. In addition, new Katakana words can easily be
                 corpora with a high accuracy, because the Katakana       extracted from a number of on-line resources, such
                 characters are seldom used to describe the conven-       as the World Wide Web. Thus, the use of Japanese
                 tional Japanese words, excepting proper nouns.           lexicons does not decrease the utility of our method.
                    However, extracting foreign words from Korean           First, we collect Katakana words from Japanese
                 corpora is more dicult, because in Korean both          resources. This can systematically be performed by
                 the conventional and foreign words are written with      means of a Japanese character code, such as EUC-
                 Hangul characters. This problem remains a chal-          JP and SJIS.
                 lenging issue in computational linguistic research.        Second, we represent the Korean corpus and
                    It is often the case that specific words have been     Japanese Katakana words by the Roman alphabet
                 imported into multiple countries simultaneously, be-     (i.e., romanization), so that the phonetic similarity
                 cause the source words (or concepts) are usually in-     can easily be computed. However, we use different
                 fluential across cultures. Thus, it is feasible that a    romanization methods for Japanese and Korean.
               72            CompuTerm 2004 Poster Session  -  3rd International Workshop on Computational Terminology
                    Third, we extract candidates of foreign words           2.3   Romanizing Korean
                  from the romanized Korean corpus. An alternative          The number of Korean Hangul characters is much
                  method is to first perform morphological analysis          greater than that of Japanese Katakana characters.
                  on the corpus, extract candidate words based on           Each Hangul character is a combination of more
                  morphemes and parts-of-speech, and romanize the           thanoneconsonant. Thepronunciationofeachchar-
                  extracted words. Our general model does not con-          acter is determined by its component consonants.
                  strain as to which method should be used in the              In Korean, there are types of consonant, i.e., the
                  third step. However, because the accuracy of anal-        first consonant, vowel, and last consonant.        The
                  ysis often decreases for new words to be extracted,       numbers of these consonants are 19, 21, and 27, re-
                  we experimentally adopt the former method.                spectively. The last consonant is optional. Thus, the
                    Finally, we compute the phonetic similarity be-         number of combined characters is 11,172. However,
                  tween each combination of the romanized Hangul            to transliterate imported words, the ocial guide-
                  and Katakana words, and select the combinations           line suggests that only seven consonants be used as
                  whose score is above a predefined threshold. As a          the last consonant. In EUC-KR, which is a stan-
                  result, we can obtain a Japanese-Korean bilingual         dard coding system for Korean text, 2,350 common
                  lexicon consisting of foreign words.                      characters are coded independent of the pronunci-
                    It may be argued that English lexicons or cor-          ation. Therefore, if we target corpora represented
                  pora can be used as source information, instead of        by EUC-KR, each of the 2,350 characters has to be
                  Japanese resources. However, because not all En-          corresponded to its Roman representation.
                  glish words have been imported into Korean, the              We use Unicode, in which Hangul characters are
                  extraction accuracy will decrease due to extraneous       sorted according to the pronunciation. Figure 2 de-
                  words.                                                    picts a fragment of the Unicode table for Korean,
                                                                            in which each line corresponds to a combination
                                                                            of the first consonant and vowel and each column
                                                                            corresponds to the last consonant. The number of
                                                                            columnsis 28, i.e., the number of the last consonants
                                                                            and the case in which the last consonant is not used.
                                                                            From this figure, the following rules can be found:
                                                                               • thefirstconsonantchangesevery21lines, which
                                                                                 corresponds to the number of vowels,
                                                                               • the vowel changes every line (i.e., 28 characters)
                                                                                 and repeats every 21 lines,
                                                                               • the last consonant changes every column.
                                                                               Based on these rules, each character and its pro-
                                                                            nunciation can be identified by the three consonant
                                                                            types. Thus, we manually corresponded only the 68
                                                                            consonants to Roman alphabets.
                     Figure 1: Overview of our extraction method.
                  2.2   Romanizing Japanese
                  BecausethenumberofphonesconsistingofJapanese
                  Katakana characters is limited, we manually pro-
                  duced the correspondence between each phone
                  and its Roman representation.       The numbers of
                  Katakana characters and combined phones are 73            Figure 2: A fragment of the Unicode table for Ko-
                  and 109, respectively. We also defined a symbol to         rean Hangul characters.
                  represent a long vowel. In Japanese, the Hepbern
                  and Kunrei systems are commonly used for roman-
                  ization purposes. We use the Hepburn system, be-             We use the ocial romanization system for Ko-
                  cause its representation is similar to that in Korean,    rean, but specific Korean phones are adapted to
                  compared with the Kunrei system.                          Japanese. For example, /j/ and /l/ are converted
                    However, specific Japanese phones, such as /ti/,         to /z/ and /r/, respectively.
                  do not exist in Korean. Thus, to adapt the Hepburn           It should be noted that the adaptation is not in-
                  system to Korean, /ti/ and /tu/ are converted to          vertible and thus is needed for both J-to-K and K-
                  /chi/ and /chu/, respectively.                            to-J directions.
                             CompuTerm 2004 Poster Session  -  3rd International Workshop on Computational Terminology        73
                    For example, the English word “cheese”, which         parametric constant used to control the importance
                 has been imported to both Korean and Japanese as         of the consonants. We experimentally set α =2.In
                 a foreign word, is romanized as /chiseu/ in Korean       addition, c and v denote the numbers of all conso-
                 and /ti:zu/ in Japanese. Here, /:/isthesymbol            nants and vowels in the two strings. The similarity
                 representing a Japanese long vowel. Using the adap-      ranges from 0 to 1.
                 tation, these expressions are converted to /chizu/
                 and /chi:zu/, respectively, which look more similar      3 Experimentation
                 to each other, compared with the original strings.       3.1   Evaluating Extraction Accuracy
                 2.4   Extracting term candidates from                    We collected 111,166 Katakana words (word types)
                       Korean corpora                                     from multiple Japanese lexicons, most of which were
                 To extract candidates of foreign words from a Ko-        technical term dictionaries.
                 rean corpus, we first extract phrases. This can be          WeusedtheKoreandocumentsetintheNTCIR-3
                 performed systematically, because Korean sentences       Cross-lingual Information Retrieval test collection2.
                 are segmented on a phrase-by-phrase basis.               This document set consists of 66,146 newspaper ar-
                    Second, because foreign words are usually nouns,      ticles of Korean Economic Daily published in 1994.
                 we use hand-crafted rules to remove post-position        We randomly selected 50 newspaper articles and
                 suxes (e.g., Josa) and extract nouns from phrases.      used them for our experiment. We asked a grad-
                    Third, we discard nouns including the last con-       uate student excluding the authors of this paper to
                 sonants that are not recommended for translitera-        identify foreign words in the target text. As a result,
                 tion purposes in the ocial guideline. Although the      124 foreign word types (205 word tokens) were iden-
                 guideline suggests other rules for transliteration, ex-  tified, which were less than we had expected. This
                 isting foreign words in Korean are not necessarily       was partially due to the fact that newspaper articles
                 regulated by these rules.                                generally do not contain a large number of foreign
                    Finally, we consult a dictionary to discard exist-    words, compared with technical publications.
                 ing Korean words, because our purpose is to extract        We manually classified the extracted words and
                 new words.    For this purpose, we experimentally        used only the words that were imported to both
                 use the dictionary for SuperMorph-K morphologi-          Japan and Korea from other languages. We dis-
                 cal analyzer1, which includes approximately 50,000       carded foreign words in Korea imported from Japan,
                 Korean words.                                            because these words were often spelled out by non-
                                                                          Katakanacharacters, such as Kanji (Chinese charac-
                 2.5   Computing Similarity                               ter). A sample of these words includes “Tokyo (the
                 Given romanized Japanese and Korean words, we            capital of Japan)”, “Heisei (the current Japanese
                 compute the similarity between the two strings and       era name)”, and “enko (personal connection)”. In
                 select the pairs associated with the score above a       addition, we discarded the foreign proper nouns for
                 threshold as translations. We use a DP (dynamic          which the human subject was not able to identify
                 programming) matching method to identify the             the source word. As a result, we obtained 67 target
                 number of differences (i.e., insertion, deletion, and     word types. Examples of original English words for
                 substitution) between two strings, on a alphabet-        these words are as follows:
                 by-alphabet basis.                                            digital, group, dollar, re-engineering, line,
                    In principle, if two strings are associated with a         polyester, Asia, service, class, card, com-
                 smaller number of differences, the similarity between          puter, brand, liter, hotel.
                 thembecomesgreater. Forthispurpose, aDice-style          Thus, our method can potentially be applied to
                 coecient can be used.                                   roughly a half of the foreign words in Korean text.
                    However, while the use of consonants in translit-       We used the Japanese words to extract plausi-
                 eration is usually the same across languages, the        ble foreign words from the target Korean corpus.
                 use of vowels can vary significantly depending on         We first romanized the corpus and extracted nouns
                 the language. For example, the English word “sys-        by removing post-position suxes. As a result, we
                 tem” is romanized as /sisutemu/ and /siseutem/           obtained 3,106 words including all the 67 target
                 in Japanese and Korean, respectively. Thus, the dif-     words. By discarding the words in the dictionary
                 ferences in consonants between two strings should        for SuperMorph-K, 958 words including 59 target
                 be penalized more than the differences in vowels.         words were remained.
                    In view of the above discussion, we compute the         Foreachoftheremaining958words,wecomputed
                 similarity between two romanized words by Equa-          the similarity between each of the 111,166 Japanese
                 tion (1).                                                words. For evaluation purposes, we varied a thresh-
                                  1−2·(α·dc+dv)                    (1)    old for the similarity and investigated the relation
                                         α·c+v                            between precision and recall.    Recall is the ratio
                 Here, dc and dv denote the numbers of differences         of the number of target foreign words extracted by
                 in consonants and vowels, respectively, and α is a       our method and the total number of target foreign
                    1http://www.omronsoft.com/                               2http://research.nii.ac.jp/ntcir/index-en.html
               74            CompuTerm 2004 Poster Session  -  3rd International Workshop on Computational Terminology
                  words. Precision is the ratio of the number of target    4 Related Work
                  foreign words extracted by our method and the total      Anumberof corpus-based methods to extract bilin-
                  number of words obtained by our method.                  gual lexicons have been proposed (Smadja et al.,
                    Table 1 shows the precision and recall for differ-      1996). In general, these methods use statistics ob-
                  ent methods. While we varied a threshold of a sim-       tained from a parallel or comparable bilingual corpus
                  ilarity, we also varied the number of Korean words       and extract word or phrase pairs that are strongly
                  corresponded to a single Katakana word (N). By           associated with each other. However, our method
                  decreasing the value of the threshold and increasing     uses a monolingual Korean corpus and a Japanese
                  the number of words extracted, the recall can be im-     lexicon independent of the corpus, which can easily
                  proved but the precision decreases. In Table 1, the      be obtained, compared with parallel or comparable
                  precision and recall are in an extreme trade-off rela-    bilingual corpora.
                  tion. For example, when the recall was 69.5%, the           Jeong et al. (1999) and Oh and Choi (2001) in-
                  precision was only 1.2%.                                 dependently explored a statistical approach to de-
                    Wemanuallyanalyzedthewordsthatwerenotex-               tect foreign words in Korean text. Although the de-
                  tracted by our method. Out of the 59 target words,       tection accuracy is reasonably high, these methods
                  12 compound words consisting of both conventional        require a training corpus in which conventional and
                  and foreign words were not extracted.       However,     foreign words are annotated. Our approach does not
                  our method extracted compound words consisting           require annotated corpora, but the detection accu-
                  of only foreign words. In addition, the three words      racy is not high enough as shown in Section 3.1. A
                  that did not have counterparts in the input Japanese     combination of both approaches is expected to com-
                  words were not extracted.                                pensate the drawbacks of each approach.
                                                                           5 Conclusion
                     Table 1: Precision/Recall for term extraction.        We proposed a method to extract foreign words,
                                     Threshold for similarity              such as technical terms and proper nouns, from Ko-
                                   >0.9       >0.7       >0.5              rean corpora and produce a Japanese-Korean bilin-
                        N=1      50.0/8.5   12.7/40.7  4.1/47.5            gual dictionary.   Specific words, which have been
                        N=10 50.0/8.5       7.4/47.5   1.2/69.5            imported into multiple countries, are usually spelled
                                                                           out by special phonetic alphabets, such as Katakana
                                                                           in Japanese and Hangul in Korean.
                                                                              Because extracting foreign words spelled out by
                  3.2   Application-Oriented Evaluation                    Katakana in Japanese lexicons and corpora can be
                                                                           performed with a high accuracy, our method ex-
                  Duringthefirstexperiment,wedeterminedaspecific             tracts words in Korean corpora that are phonetically
                  threshold value for the similarity between Katakana      similar to Japanese Katakana words. Our method
                  and Hangul words and selected the pairs whose sim-       doesnotrequireparallelorcomparablebilingualcor-
                  ilarity was above the threshold. As a result, we ob-     pora and human annotation for these corpora.
                  tained 667 Korean words, which were used to en-             We also performed experiments in which we ex-
                  hancethedictionary for the SuperMorph-K morpho-          tracted foreign words from Korean newspaper arti-
                  logical analyzer.                                        cles and used the resultant dictionary for morpho-
                    We performed morphological analysis on the 50          logical analysis. We found that our method did not
                  articles used in the first experiment, which included     correctly extract compound Korean words consist-
                  1,213 sentences and 9,557 word tokens. We also in-       ing of both conventional and foreign words. Future
                  vestigated the degree to which the analytical accu-      work includes larger-scale experiments to further in-
                  racy is improved by means of the additional dictio-      vestigate the effectiveness of our method.
                  nary. Here, accuracy is the ratio of the number of
                  correct word segmentations and the total segmenta-       References
                  tions generated by SuperMorph-K. The same human          Kil Soon Jeong, Sung Hyon Myaeng, Jae Sung Lee,
                  subject as in the first experiment identified the cor-        and Key-Sun Choi. 1999. Automatic identification
                  rect word segmentations for the input articles.             and back-transliteration of foreign words for informa-
                    First, we focused on the accuracy of segmenting           tion retrieval. Information Processing & Management,
                  foreign words.   The accuracy was improved from             35:523–540.
                  75.8% to 79.8% by means of the additional dictio-        Jong-Hoon Oh and Key sun Choi. 2001. Automatic
                  nary. The accuracy for all words was changed from           extraction of transliterated foreign words using hid-
                  94.6% to 94.8% by the additional dictionary.                den markov model. In Proceedings of ICCPOL-2001,
                    In summary, the additional dictionary was effec-           pages 433–438.
                                                                           Frank Smadja, Kathleen R. McKeown, and Vasileios
                  tive for analyzing foreign words and was not asso-          Hatzivassiloglou. 1996. Translating collocations for
                  ciated with side effect for the overall accuracy. At         bilingual lexicons: A statistical approach. Computa-
                  the same time, we concede that we need larger-scale         tional Linguistics, 22(1):1–38.
                  experiments to draw firmer conclusions.
The words contained in this file might help you see if this file matches what you are looking for:

...Computerm poster session rd international workshop on computational terminology term extraction from korean corpora via japanese atsushi fujii tetsuya ishikawa jong hyeok lee graduate school of library division electrical and information media studies computer engineering university tsukuba pohang science technology kasuga advanced research center japan san hyoja dong nam gu slis ac jp republic korea jhlee postech kr abstract large number foreign words in can also be this paper proposes a method to extract such as technical terms proper nouns addition the produce corresponding same source word are bilingual dictionary specic have phonetically similar for example english been imported into multiple countries simultane system has both ously if they inuential across cultures romanized sisutemu pronunciation is dierent siseutem respectively languages our extracts motivated by these assumptions we propose that katakana whichcaneasilybeidentiedinjapanesecor means brief per pora show eectiven...

no reviews yet
Please Login to review.