jagomart
digital resources
picture1_Russian Alphabet Pdf 99500 | Mt 1957 Zacharov


 146x       Filetype PDF       File size 0.14 MB       Source: aclanthology.org


File: Russian Alphabet Pdf 99500 | Mt 1957 Zacharov
a refinement in coding the russian cyrillic alphabet b zacharov london university london england by reducing the number of characters to be coded the problem of devising a numerical code ...

icon picture PDF Filetype PDF | Posted on 21 Sep 2022 | 3 years ago
Partial capture of text on file.
                                            [Mechanical Translation, vol.4, no.3, December 1957; pp. 76-78] 
              
             A Refinement in Coding the Russian Cyrillic Alphabet 
             B. Zacharov, London University, London, England 
                        By reducing the number of characters to be coded the problem of devising a 
                        numerical code for the Cyrillic alphabet can be simplified.    This reduction can be 
                        achieved by providing code-words for only the lower-case forms of characters that 
                        do not occur initially;   by disregarding the diacritic of the character   ё, and by 
                        disregarding the character  ё   entirely.   Ambiguities that arise in the latter cases 
                        can be resolved by an examination of the context. 
             THE PROBLEM of coding the Russian Cyrillic                               last approach has been considered in a recent 
             alphabet in numerical form has been considered                           paper on mechanical translation3   where all the 
             previously in several papers 1  and it is clear                          lower-case characters, except ё, и, ъ and ь 
             that it would be desirable if each character of                          are represented by a five binary-digit code, 
             the Russian alphabet (together with any re-                              while all the capitals and decimal numbers use 
             quired numbers, punctuation marks and capitals)                          a ten bit code;   in the code proposed in that 
             could be coded in such a way that a separate                             paper simplification is obtained on the basis of 
             unique numerical code-word existed for each                              the statement that "... five of the 33 Russian 
             lower-case character,  capital,  etc.   Unfortu-                         letters never start a word and will not need to 
             nately, the speed of modern digital computers                            be capitalized ... ".   The five Russian letters 
             and the size of their memories are such that a                           referred to are  ё,   и,  ъ,  ь, ы. 
             code of this form would result in considerable                              All the  other  Russian characters occur fre- 
             time being spent in the memory search for the                            quently in both upper and lower case and re- 
             appropriate target language equivalent.                                  quire to be  coded separately in both these 
                It is clear, then, that ways must be found,                           forms or by the same numerical code, except 
             apart from engineering advances, to speed up                             that the upper case is always preceded by some 
             the memory search time.   One way of doing                               number which denotes an 'upper-case shift'. 
             this would be to decrease the amount of lin-                                Inspection of the statement quoted above re- 
             guistic data stored in the memory, and this has                          veals that it is formally incorrect with respect 
                                    2                                                 to   ё   although it is  quite  correct to state that 
             been considered.      Another method would be to 
             decrease the amount of numerical data (i.e.,                             none of the four characters  й,  ъ,   ь,  and ы 
             the number of bits) in the memory for a given                            ever begin a word in the Russian language so 
             number of source language characters.   This                             that clearly,   it will never be necessary for 
                                                                                      them to be coded in upper-case form.   (A rig- 
                                                                                      orously phonetic transliteration of some  other 
                                                                                      alphabet into Russian may create a trivial ex- 
             1.    Harper, K.E.,  "The Mechanical Transla-                            ception in the cases of  й   and  ы   This will not 
             tion of Russian: Preliminary Report", Modern                             be considered here.) 
             Language Forum,   vol.38, no. 3-4, pp.  12-29, 
             Sept. - Dec.  1953. 
             2.    Oettinger, A. G., "The Design of an Auto- 
             matic Russian-English Dictionary",   Machine                             3.    Wall,  R. E.,  "Some of the Engineering As- 
             Translation of Languages,   John Wiley and                               pects of the Machine Translation of Languages", 
             Sons, New York (1955),  pp. 47-65.                                       AIEE Transactions, I, vol.75,  580 (1956). 
                                                          Refinement in Coding                                               77 
                  The Problem of ё                                        as the  corresponding letters  of the (x)-word 
                                                                   4      except that  ё   in (x) is replaced by  e   in (y). 
                    Reference to a Russian-English dictionary               Examination of a Russian-English dictionary 
                  shows us that many words of the Russian lan-            reveals that this does not occur  often in the 
                  guage begin with   ё     Notable examples  are          stem of a word.   Similarly,  experience tells us 
                  ёлка   'fir tree' and   ёмкость 'capacity';   the       that ambiguity seldom arises as  a result of 
                  latter is of especial importance in scientific          word endings together with stem. 
                  texts.                                                    Examples of words where ambiguity may oc- 
                  Superficially, therefore, it would appear that          cur are: 
                  ё should be treated in the same way as the                 все         all    (plural) 
                  other word-initial characters and that it should           всё         all    (singular, neuter) 
                  be coded in upper and lower case.   However,                         of the village    (genitive,  singular) 
                  the following points must be considered,                   села       she sat 
                  i) In practice,   ё is never written in script             сёла       villages (nominative/accusative, pl.) 
                  form with the diacritic,  either in lower or 
                  upper case —   e   and   E   are used.                     Whereas discrepancy need not necessarily 
                  ii) A modern standard Russian typewriter key-            occur in the first example, considerable ambi- 
                  board does not contain   Ё   or   ё —  the up-           guity can arise in the  second case since the 
                  per and lower case forms of   e   are used,              words  are different grammatical forms  of 
                  as in (i).                                               widely different words ( сёла  is a plural noun 
                 iii) Both   ё   and   Ё   frequently appear in print,     while села may be a verb form or a singular 
                     especially in the texts of scientific peri-           noun). 
                     odicals .                                               However, we note that if the contexts of these 
                 Thus,  from (i), (ii) and (iii) above, it can be         words are examined, most cases of ambiguity 
                 seen that the problem of encoding  ё   and Ё             disappear (this is especially true for Russian 
                 is complicated by the source of the Russian              where strict grammatical rules  concerning 
                 language text.  If e and ё are coded separately,         case endings and conjugation must be observed). 
                 it would appear that words containing ё would            Indeed,   such an examination is essential for 
                 have to be stored in the memory in two separate          certain words in Russian and, more especially, 
                 locations, with both   e   and   ё  in the corre-                    5
                 sponding positions of each word.                         in English.   
                                                                             Certain Russian words are such that their 
                 a)  ё at the beginning of a word                          spelling is associated with multiple meaning 
                   For words with   ё   at the beginning,  any cod-        and, here, it is often the case that an examina- 
                 ing difficulty can be overcome if it is noted that,       tion of the context will not reveal which alter- 
                 if the diacritic is ignored,   no ambiguity can           native is meant.   In this event it becomes nec- 
                 arise.   This is because no two words in the              essary to print out all the alternatives stored 
                 Russian language exist with different meaning             in the computer memory which correspond to 
                 such that corresponding letters of both words             the source word.  At this stage a simplification 
                 are the same except that   ё   at the beginning of        may be effected if the computer dictionary is 
                 the first word is replaced by  e   in the second          concerned only with a certain field (e.g., nu- 
                 word.   As a result of this consideration it will         clear physics), in which case only those terms 
                 clearly never be necessary to encode   ё   in             which may reasonably be expected to relate to 
                 capitalized form —  the upper-case form of e              that field will be printed out. 
                 will be sufficient.                                         Examples of Russian words in such a cate- 
                                                                           gory are: 
                 b)  ё in any letter position                                          замок castle 
                   If ё occurs in some letter position other than                                      lock 
                 at the beginning of some word (x), ambiguity 
                 can arise only if another word (y) exists such                                        twist 
                 that all the letters of the (y)-word are the same                     замотать           shake 
                                                                           5.    Yngve, V.H.,  "Syntax and the Problem of 
                 4. Smirinskii, A.I., Russian-English Dic-                 Multiple Meaning",   Machine Translation of 
                 tionary, State Publishing House for Foreign               Languages,   John Wiley and Sons,   New York 
                 and National Dictionaries, Moscow, (1952).                (1955), pp.208-226. 
            78                                                В. Zacharov 
              In the two examples above, ambiguity will                 the Russian language. This may be of some 
            disappear if the words are used in idiomatic                importance since the character can be repres- 
            context  (e.g. padlock = висячий замок).                    ented in several different ways, namely: 
              In the case of words containing e or ё, how-               i) as ъ. 
            ever, difficulties of multiple meaning that can-             ii) as  ' 
            not be resolved by simple context (i. e., syntax) 
            examination are very rare.   In fact, in the                iii) as a gap in a word 
            author's  experience,   no example can readily              iv) it is ignored completely. 
            be quoted. 
            Suggested Encoding Rules                                      As in the above encoding rules, if ambiguity 
            From the above considerations,   a set of                   occurs because ъ is ignored, the context of the 
            rules can be formulated to include words con-               word must be examined.   An example of words 
            taining   ё   and   Ё.    They are:                         where this kind of difficulty can arise is 
            i) Source language words containing   ё or   Ё                            сесть      =   sit down 
            are stored in the dictionary in numerical                                 съесть    =   eat 
            form as if they contained  e   or   E   in the                In these cases, if a unique meaning cannot be 
            corresponding letter positions,                             found simply from the program, all the target- 
            ii) Incoming source language words are coded                language equivalents will have to be printed out 
            with a unique number code for every lower-                  and the required meaning determined by post- 
            case character except  ё   which is treated                 editing. 
            as if it were   e.   All upper-case characters 
            will have unique number codes correspond-                     From an examination of the occurrence of e 
            ing to them (or they will be preceded by a                  in the Russian language it seems that, if the 
            coded upper-case symbol),   except   Ё,                     diacritic is ignored the chances of ambiguity 
            where the diacritic is ignored and the char-                occurring in MT,   with the rules formulated 
            acter is treated as if it were   E; й,  ъ,   ь ,            above,  are very slight.   Indeed, for a specific 
            and ы will have no upper-case code,                         subject,  where all the source language words 
            iii) If more than one target language alterna-              in the dictionary are known, most cases of am- 
            tive is found, the context of the Russian lan-              biguity and difficulties  of multiple meaning 
            guage word must be examined;   this will also               could be overcome by sufficiently sophisticated 
            be required for any other word (not contain-                programming techniques (i.e.,  syntactical and 
            ing   e   or ё) where ambiguity may exist —                 idiomatic context examination for all the cases 
            as in the examples above.                                   of expected ambiguity). 
            The Problem of ъ                                              As to ъ, it may be ignored in the encoding. 
              It may be noted that ъ could also be ignored              The few cases of ambiguity will be resolved 
            completely since it occurs so very rarely in                from a study of context. 
The words contained in this file might help you see if this file matches what you are looking for:

...A refinement in coding the russian cyrillic alphabet b zacharov london university england by reducing number of characters to be coded problem devising numerical code for can simplified this reduction achieved providing words only lower case forms that do not occur initially disregarding diacritic character and entirely ambiguities arise latter cases resolved an examination context last approach has been considered recent form paper on mechanical translation where all previously several papers it is clear except would desirable if each are represented five binary digit together with any re while capitals decimal numbers use quired punctuation marks ten bit proposed could such way separate simplification obtained basis unique word existed statement capital etc unfortu letters never start will need nately speed modern digital computers capitalized size their memories referred result considerable other fre time being spent memory search quently both upper appropriate target language equiv...

no reviews yet
Please Login to review.