163x Filetype PDF File size 0.15 MB Source: www.apiit.edu.in
A Transliteration Keyboard Configuration with Tamil Unicode Characters * # M.A.C.M. Raafi and H. M. Nasir *Department of Mathematical Sciences, South Eastern University of Sri Lanka E-mail: raafim@seu.ac.lk # Department of Mathematics, University of Peradeniya, Sri Lanka E-mail: nasirh@pdn.ac.lk Abstract Aayitha character. In Tamil word-processors the large Keyboard configurations for typing are available for many numbers of compound alphabets are obtained by a sequential languages and for data processing tasks. The common keying of the corresponding consonant and vowel. For keyboard used today is QWERTY keyboard. The QWERTY example, the keystrokes for consonant k (க்) followed by keyboard layout is specially designed for typing English vowel I leads to appearance of compound character ki (கி). alphabets and numerals. Typing for other languages needs these configurations which remap the QWERTY keys to fit for Keyboard layouts of this kind have been called "phonetic". other languages. This configuration often faces difficulties due Tamil transliteration is phonetic keyboard system. Thus, the to large number of character sets in these languages other than Tamil word for father (அப்பா) is written as appA (or appaa), English. To solve this issue, transliteration keyboard mother (அம்மா) as 'ammA' (or as ammaa) in the configuration is to be considered. Transliteration is a method by which one could read a text of a language in the writing transliteration program. method of another language. In this paper, phonetically we The following advantages are available normally in ourv discuss about developing a transliteration keyboard transliteration system. configuration for Tamil language using Unicode encodings. 1. A user-friendly keystrokes; users easily type in more familiar way. 2. No need to memorize whole the mapping key strokes Introduction of the keyboard. Input devices are used to enter data and commands in the 3. New person entering from some other language can computer system for data processing work. One of the type easily. commonly used input devices is the keyboard which consists 4. We don’t need to change the font each time to type of letters, numerals and other special characters. following characters special character and symbols There are different types of keyboard system available in such as: / , : < > | ) ( * & ^ % $ # @ ! ~ + the computing environment. The standard keyboard is known ?...............etc as the QWERTY keyboard. This keyboard is specially 5. By introducing Unicode designed to type English Language letters and related a. It can be displayed everywhere symbols. Use of other languages, such as Asian languages, the b. No matter about the language QWERTY keyboard is inconvenient. Entering these Asian language characters using this 6. No matter about the font QWERTY keyboard is impossible without a proper 7. Wrong word format is being corrected. convenient configuration mapping for the English keys in the keyboard. Even with the configuration mapping, typing the Encoding Systems letter of the language is difficult, because one has to memorize Encoding scheme is a necessary part of the configuration of a or be familiar with the keyboard mapping in the configuration. keyboard layout for the transliteration program. The encoding Despite these limitations, transliteration is to be is the system by which the characters in a set are represented considered for typing texts to the benefit of end users. The in binary form in a file. In computers and in data transmission transliteration is the process by which one reads and between them, i.e. in digital data processing and transfer, data pronounces the words and sentences of one language using the is internally presented as octets, as a rule. Octets are often letters and special symbols of another language. It is helpful called bytes, but in principle, octet is a more definite concept in situations where one does not know the script of a language than byte. Internally, octets consist of eight bits [6]. but knows how to speak and understand the language [1]. For example, one of the Asian languages, Tamil, can be Tamil Character encodings introduced to English literate Tamils and non-Tamils with a In Tamil, the forms of some of the letters differ from one to transliteration scheme. There are 247 characters in Tamil: 12 another for the same vowel sound. This is the reason for the vowels, 18 consonant, 216 compound alphabets and one inclusion of a high number of letters in the Tamil keyboards [Page No. 135] th 5 IEEE International Conference on Advanced Computing & Communication Technologies [ICACCT-2011] ISBN 81-87885-03-3 designed so far. Tamil is a language, where in addition to the Unicode Code Charts basic vowels (uyir) and consonants (mei), the compounded The code charts that follow present the characters of the (uyirmei) characters, all have unique glyph forms. Some Unicode Standard. Characters are organized into related popular Tamil font encoding schemes are TSCII, TAM, TAB, groups called blocks. In the Unicode Standard, character ISCII and Unicode. blocks generally contain characters from a single script. In many cases, a script is fully represented in its character block. TSCII There are, however, important exceptions, most notably in the The first and most popular one is the Tamil Standard Code for area of punctuation characters. Information Interchange (TSCII), a glyph-based, 8-bit bilingual encoding. It uses a unique set of glyphs; the usual lower ASCII set. Roman letters with standard punctuation Literature Review marks occupy the first 128 slots and the Tamil glyphs occupy Transliteration of Asian language input is a subject of recent the upper ASCII segment with slots 128-256. research. During the past several years, different methods have been introduced to prepare Indian language documents by TAM and TAB entering the text through specific transliteration schemes. Data entry through transliteration is quite close to phonetic mapping TAM is a Monolingual encoding scheme (TAmil Monolingual) where TAB is a Bilingual encoding scheme of Indian language characters to the letters of the Roman (TAmil Bilingual). They were proposed by the Tamil Nadu alphabet. Government. TAM is limited use in an OS environment. The earliest and widely used transliteration scheme is what is known as Library Of Congress Transliteration ISCII Scheme. This uses roman alphabets with diacritics (horizontal Indian Standard Code for Information Interchange, ISCII is a bars or circles added above or below roman alphabets) to 8-bit /single byte umbrella standard, defined in such a way that represent alphabets of Indian languages. Diacritical markers all Indian languages can be treated using one single character added to a letter or symbol show its pronunciation, accent, encoding scheme. ISCII is a bilingual character encoding (not etc., typically indicating that a phonetic value is different from glyphs-based!) scheme. Roman characters and punctuation the unmarked state. The scheme is very general in scope and marks as defined in the standard lower-ASCII take up the first hence can be used in almost all world languages. Established half the character set (first 128 slots). Characters for Indic Tamil research centers all around the world are aware of this languages are allocated to the upper slots (128-255) [5]. scheme and most of them implement this scheme as such without modifications [5]. Unicode ADAMI was one of the early Tamil word-processors for Unicode is an international standard for multi-lingual word- MS-DOS PCs produced by Dr. K. Srinivasan of Canada in processing. It is a two-byte encoding scheme which covers the early eighties released in 1984 to recast such transliterated text entire world's common writing systems. It represents each into Tamil. The Tamil text is to be typed using a plain ASCII character as a 2-byte number, from 0 to 65535. Each 2 byte transliteration scheme. Upon compiling and execution of the number represents a unique character used in at least one of linked macro, this romanized text page is recast on screen in the world's languages. There is exactly 1 number per equivalent Tamil. One needs to return to the romanized text character, and exactly 1 character per number. It provisions mode to make the corrections if any. In a more recent version over 65000 slots to handle nearly all world more than 50 of this software called THIRU, a split screen, where the roman languages simultaneously. Along with other Asian languages, text being typed in the bottom half of the screen is for example Tamil has been assigned specific slots from continuously recast in the upper half in Tamil. ADHAWIN is U+0B80 to U+0BFF (which, in decimal, is from 2944 to another recent implementation of the same software for 3071; 128 locations) in this multi-lingual standard [6]. Windows-based PCs [5]. Unicode encodes only basic vowels and consonant Murasu and Anjal word-processing packages are widely characters and a set of modifiers to represent situations where used in Malaysian, Singaporean and Tamil Newspapers and the vowel/consonant pair appear as a combination (uyirmei) in Magazines. These packages belong to the group of "romanized Tamil language. Unicode file stores textual information solely input and interpreted output" tools. The ‘inaimathi’ and related at this "character" level. It does not care about the actual form fontfaces used in these packages are of the 8-bit bilingual type. of the glyphs. Rendering of the glyphs corresponding to stored The first 128 (0-127) slots are filled by roman characters as in characters is left to softwares. basic ASCII and the Tamil characters occupy the upper ASCII Once we get beyond the ASCII world, there are many slots (128-255). By invoking the keyboard editor it is possible different native encodings for different languages and to access either of these two blocks. In the Tamil typing mode, operating systems. Converting between all of these is easiest the roman keyboard strokes and their relative sequence are with a central "common point", and that is Unicode. continuously interpreted to present equivalent Tamil Technically, Unicode is used wherever the characters characters on screen. Thus we can type 'kathai' to get the used are all drawn from the Unicode set in other words, just equivalent Tamil word ‘கைத’ [8]. about everywhere. Systems that use ASCII are also using Unicode, since Unicode contains the ASCII set and gives them the same code points they had in ASCII [6]. Keyboard Configuration Program There are number of computer programs used to develop [Page No. 136] A Transliteration Keyboard Configuration with Tamil Unicode Characters transliteration keyboard configuration softwares such as Methodology Keyman, C, C++, Java. In our work we take Keyman as a The keyboard program interprets and translates input from the keyboard configuration program. Keyman is a keyboard computer keyboard according to a set of rules called a management utility that makes it practical to input many keyboard. Transliteration of Tamil has to fit the need for different languages. It is fully supports Unicode and allows us Tamil to be recognized as the only other known language to creating our own keyboard layouts for use. It interprets and comparable to the English language with a 26-letter keyboard. translates input from the computer keyboard according to a set It is the plan of our work to develop simple methods to use of rules called a keyboard. These rules are stored in a Tamil in the computer and introduce Tamil through keyboard file. It includes features such as an on-screen transliteration. keyboard, phonetic and visual-order input methods. We have over 230 characters in Tamil language;13 Keyman includes full support for Unicode. It support vowels(uyir), 18 consonants(meis) and compound (uyirmeis) input and output of any of the thousands of characters defined derived from these. Tamil is one of the Indian languages in Unicode. There are two applications included in Keyman where many of the compound (uyirmei) alphabets have Developer: TIKE and KMComp. TIKE, the Tavultesoft complex geometric structure (glyph) of their own. There are Integrated Keyboard Editor is a complete environment for 12 vowels characters and one aayitham letter in Tamil designing, developing, testing, and packaging our keyboards language. for distribution. There are 18 Mei Letters(consonants) and 216 Uyir-Mei. KMComp, the command-line compiler, is a simple tool The Mei characters are created with sign Anushvara ( ◌ஂ ). The that lets us compile keyboards, packages, and installers from Uyir-Mei letters are created by the combination of the above the command-line. This is useful if we want to use batch 12 Uyir letter with the 18 Mei characters(12X18). builds or Make files. Also there are 13 digits used in Tamil. These character digits are now not much used by people but these characters Keyboard File were used in early times. They are as follows: Keyboard file is the most important component in a keyboard configuration. It contains the set of rules to represent the particular keyboard. As we want to create a new keyboard, we want to create a keyboard file. There are two ways to create a 0 1 2 3 4 5 6 7 8 9 10 100 1000 keyboard file: The Keyboard Wizard Choosing the mapping for characters It gives us a simple interface to quickly create a keyboard We define the output characters to be produced by the using a visual representation of a computer keyboard. We can keyboard. We select the appropriate keystrokes from the drag and drop characters from a character map, and create QWERTY keyboard to map the output characters. Some ANSI and Unicode keyboard layouts. We cannot access most keystrokes are used to represent output characters while some of the programs more powerful features from the Keyboard keys are not. These keystrokes that do not represent any output Wizard, but it will be useful to get us started on our design. are called dead keys. Dead keys produce null output. We can convert keyboards created in the Keyboard Wizard to standard program source files in TIKE. Analyzing the Keystrokes and Assigning Keystroke We want to analyze how to create all the Tamil Characters The Keyboard Language using this limited number of codes. Some characters have It provides the flexibility that is needed to write keyboards direct Unicode numbers so it can be assigned directly while with complex character management, including constraints, some other characters; they don’t have their own Unicode dead keys, post-entry parsing, virtual key management numbers. So, we have to assign them for Unicode characters (accessing any key on the keyboard), and other features. by combining two or more other Unicode characters. It is A keyboard file is divided into two sections: the header being assigned a key or collection of key strokes to a and the rules. The header section defines the name of the particular character or combination of characters to represent keyboard, its bitmap, and other general settings. The rules are Tamil characters. To represent a character one or more key used to define how the keyboard responds to keystrokes from strokes can be used. the user, and are divided into groups. The 247 letters in the Tamil alphabet are the product of 31 The keyboard header is the first part of a keyboard; it basic Tamil letters. 18 English letters have similar sound consists of statements that help Keyman identify the keyboard connection with 18 Tamil letters. It is only the 13 remaining and set default options for it. Each statement in the header Tamil letters that need a ‘sound connection’ with English. We must be on a separate line and is usually written with capital can make the ‘sound connection,’ - that is, devise the new letters. The body of the keyboard is another the most connections- by allocating letters that are in use either in important part: it determines the behavior of the keyboard. combination or singly as follows; The body consists of groups, which in turn contain one or more rules which define the responses of the keyboard to + "a" > U+0B85 அ certain keystrokes. U+0B85 + "a" > U+0B86 ஆ [Page No. 137] th 5 IEEE International Conference on Advanced Computing & Communication Technologies [ICACCT-2011] ISBN 81-87885-03-3 +"A" > U+0B86 ஆ diveintophython.org/toc /index.html. + "i" > U+0B87 இ [7] Muguntharaj, Tamil-TSCIIANJAL, 1998. [8] Muthu Nedumaran, Murasu Anjal, 2000. U+0B87 + "i" > U+0B88 ஈ [9] Ramalingam Shanmugalingam, jAzhan,Transliteration + "I" > U+0B88 ஈ of Tamil to English for the Information Technology, 2002. [10] Samaranayake, V. K., Nandasara, S. T., Dissanayake, Conclusion J. B., Weerasinghe, A.R.,Wijayawardhana, H., An Usage of Tamil language in computers enters a new era with Introduction to UNICODE for Sinhala Characters, the emerge of the Unicode standard with the support of more University of Colombo School of Computing, 2003. modern platforms and applications. These days, most of the Tamil websites support Unicode and typography related techniques also switching into the new standard. This paper is useful to people who are interested in developing their own transliteration softwares to type words and sentences for their word processing work and to do World Wide Web applications easily using QWERTY keyboard. Also this study provides solutions for some existing problems with Tamil typography. Many non-Unicode Tamil fonts with stylish glyphs are available at present. Usage of such fonts in documents can give great appearance. But due to the unfamiliar keyboard mapping to these fonts, these are not widely used in typing of Tamil. It is possible to develop these stylish fonts into familiar keyboard configuration mapping, of course with the support of keyboard configuration environment. Then we can use it with our keyboard configuration. It is also possible to extend this keyboard configuration to other platforms like Linux, Mac OS, Solaris, etc. as these are already supporting Unicode. Only thing to be done is to set up a keyboard layout in each Operating system’s native format. Appendix Some Typing Example. naan or nAn நான் avan அவன் manithan மனிதன் paadasaalai பாடசாைல paLkaLaikazakam பல்கைலகழகம் References [1] Acharya, Multilingual Computing for Literacy and Education, SDL, IIT Madrass, India, http://acharya.iitm.ac.in/acharya.html, 2005. [2] Addison-Wesley Pub Co, The Unicode Standard 3.0 (www.unicode.org), 1998. [3] Elengo, Tamil 99 Keyboard Layout, www.cadgraf.com, 2000. [4] Ilakkuvanar, S., Tholkappiyam in English. [5] Kalyanasundaram, K., An Overview Of Different Tools For Word-ProcessingOf Tamil And A Proposal Towards Standardisation, Institute of Physical Chemistry, Swiss Federal Inst. of Technology, 1997. [6] Mark Pilgrim, “Python and Unicode”, http:// [Page No. 138]
no reviews yet
Please Login to review.