146x Filetype PDF File size 0.14 MB Source: aclanthology.org
[Mechanical Translation, vol.4, no.3, December 1957; pp. 76-78] A Refinement in Coding the Russian Cyrillic Alphabet B. Zacharov, London University, London, England By reducing the number of characters to be coded the problem of devising a numerical code for the Cyrillic alphabet can be simplified. This reduction can be achieved by providing code-words for only the lower-case forms of characters that do not occur initially; by disregarding the diacritic of the character ё, and by disregarding the character ё entirely. Ambiguities that arise in the latter cases can be resolved by an examination of the context. THE PROBLEM of coding the Russian Cyrillic last approach has been considered in a recent alphabet in numerical form has been considered paper on mechanical translation3 where all the previously in several papers 1 and it is clear lower-case characters, except ё, и, ъ and ь that it would be desirable if each character of are represented by a five binary-digit code, the Russian alphabet (together with any re- while all the capitals and decimal numbers use quired numbers, punctuation marks and capitals) a ten bit code; in the code proposed in that could be coded in such a way that a separate paper simplification is obtained on the basis of unique numerical code-word existed for each the statement that "... five of the 33 Russian lower-case character, capital, etc. Unfortu- letters never start a word and will not need to nately, the speed of modern digital computers be capitalized ... ". The five Russian letters and the size of their memories are such that a referred to are ё, и, ъ, ь, ы. code of this form would result in considerable All the other Russian characters occur fre- time being spent in the memory search for the quently in both upper and lower case and re- appropriate target language equivalent. quire to be coded separately in both these It is clear, then, that ways must be found, forms or by the same numerical code, except apart from engineering advances, to speed up that the upper case is always preceded by some the memory search time. One way of doing number which denotes an 'upper-case shift'. this would be to decrease the amount of lin- Inspection of the statement quoted above re- guistic data stored in the memory, and this has veals that it is formally incorrect with respect 2 to ё although it is quite correct to state that been considered. Another method would be to decrease the amount of numerical data (i.e., none of the four characters й, ъ, ь, and ы the number of bits) in the memory for a given ever begin a word in the Russian language so number of source language characters. This that clearly, it will never be necessary for them to be coded in upper-case form. (A rig- orously phonetic transliteration of some other alphabet into Russian may create a trivial ex- 1. Harper, K.E., "The Mechanical Transla- ception in the cases of й and ы This will not tion of Russian: Preliminary Report", Modern be considered here.) Language Forum, vol.38, no. 3-4, pp. 12-29, Sept. - Dec. 1953. 2. Oettinger, A. G., "The Design of an Auto- matic Russian-English Dictionary", Machine 3. Wall, R. E., "Some of the Engineering As- Translation of Languages, John Wiley and pects of the Machine Translation of Languages", Sons, New York (1955), pp. 47-65. AIEE Transactions, I, vol.75, 580 (1956). Refinement in Coding 77 The Problem of ё as the corresponding letters of the (x)-word 4 except that ё in (x) is replaced by e in (y). Reference to a Russian-English dictionary Examination of a Russian-English dictionary shows us that many words of the Russian lan- reveals that this does not occur often in the guage begin with ё Notable examples are stem of a word. Similarly, experience tells us ёлка 'fir tree' and ёмкость 'capacity'; the that ambiguity seldom arises as a result of latter is of especial importance in scientific word endings together with stem. texts. Examples of words where ambiguity may oc- Superficially, therefore, it would appear that cur are: ё should be treated in the same way as the все all (plural) other word-initial characters and that it should всё all (singular, neuter) be coded in upper and lower case. However, of the village (genitive, singular) the following points must be considered, села she sat i) In practice, ё is never written in script сёла villages (nominative/accusative, pl.) form with the diacritic, either in lower or upper case — e and E are used. Whereas discrepancy need not necessarily ii) A modern standard Russian typewriter key- occur in the first example, considerable ambi- board does not contain Ё or ё — the up- guity can arise in the second case since the per and lower case forms of e are used, words are different grammatical forms of as in (i). widely different words ( сёла is a plural noun iii) Both ё and Ё frequently appear in print, while села may be a verb form or a singular especially in the texts of scientific peri- noun). odicals . However, we note that if the contexts of these Thus, from (i), (ii) and (iii) above, it can be words are examined, most cases of ambiguity seen that the problem of encoding ё and Ё disappear (this is especially true for Russian is complicated by the source of the Russian where strict grammatical rules concerning language text. If e and ё are coded separately, case endings and conjugation must be observed). it would appear that words containing ё would Indeed, such an examination is essential for have to be stored in the memory in two separate certain words in Russian and, more especially, locations, with both e and ё in the corre- 5 sponding positions of each word. in English. Certain Russian words are such that their a) ё at the beginning of a word spelling is associated with multiple meaning For words with ё at the beginning, any cod- and, here, it is often the case that an examina- ing difficulty can be overcome if it is noted that, tion of the context will not reveal which alter- if the diacritic is ignored, no ambiguity can native is meant. In this event it becomes nec- arise. This is because no two words in the essary to print out all the alternatives stored Russian language exist with different meaning in the computer memory which correspond to such that corresponding letters of both words the source word. At this stage a simplification are the same except that ё at the beginning of may be effected if the computer dictionary is the first word is replaced by e in the second concerned only with a certain field (e.g., nu- word. As a result of this consideration it will clear physics), in which case only those terms clearly never be necessary to encode ё in which may reasonably be expected to relate to capitalized form — the upper-case form of e that field will be printed out. will be sufficient. Examples of Russian words in such a cate- gory are: b) ё in any letter position замок castle If ё occurs in some letter position other than lock at the beginning of some word (x), ambiguity can arise only if another word (y) exists such twist that all the letters of the (y)-word are the same замотать shake 5. Yngve, V.H., "Syntax and the Problem of 4. Smirinskii, A.I., Russian-English Dic- Multiple Meaning", Machine Translation of tionary, State Publishing House for Foreign Languages, John Wiley and Sons, New York and National Dictionaries, Moscow, (1952). (1955), pp.208-226. 78 В. Zacharov In the two examples above, ambiguity will the Russian language. This may be of some disappear if the words are used in idiomatic importance since the character can be repres- context (e.g. padlock = висячий замок). ented in several different ways, namely: In the case of words containing e or ё, how- i) as ъ. ever, difficulties of multiple meaning that can- ii) as ' not be resolved by simple context (i. e., syntax) examination are very rare. In fact, in the iii) as a gap in a word author's experience, no example can readily iv) it is ignored completely. be quoted. Suggested Encoding Rules As in the above encoding rules, if ambiguity From the above considerations, a set of occurs because ъ is ignored, the context of the rules can be formulated to include words con- word must be examined. An example of words taining ё and Ё. They are: where this kind of difficulty can arise is i) Source language words containing ё or Ё сесть = sit down are stored in the dictionary in numerical съесть = eat form as if they contained e or E in the In these cases, if a unique meaning cannot be corresponding letter positions, found simply from the program, all the target- ii) Incoming source language words are coded language equivalents will have to be printed out with a unique number code for every lower- and the required meaning determined by post- case character except ё which is treated editing. as if it were e. All upper-case characters will have unique number codes correspond- From an examination of the occurrence of e ing to them (or they will be preceded by a in the Russian language it seems that, if the coded upper-case symbol), except Ё, diacritic is ignored the chances of ambiguity where the diacritic is ignored and the char- occurring in MT, with the rules formulated acter is treated as if it were E; й, ъ, ь , above, are very slight. Indeed, for a specific and ы will have no upper-case code, subject, where all the source language words iii) If more than one target language alterna- in the dictionary are known, most cases of am- tive is found, the context of the Russian lan- biguity and difficulties of multiple meaning guage word must be examined; this will also could be overcome by sufficiently sophisticated be required for any other word (not contain- programming techniques (i.e., syntactical and ing e or ё) where ambiguity may exist — idiomatic context examination for all the cases as in the examples above. of expected ambiguity). The Problem of ъ As to ъ, it may be ignored in the encoding. It may be noted that ъ could also be ignored The few cases of ambiguity will be resolved completely since it occurs so very rarely in from a study of context.
no reviews yet
Please Login to review.