186x Filetype PDF File size 0.29 MB Source: www.cle.org.pk
Towards Sindhi Corpus Construction Mutee U Rahman Department of Computer Science, Isra University, Hyderabad Sindh 71000, Pakistan muteeurahman@gmail.com Abstract Persio-Arabic script Sindhi corpus being constructed is in Persio-Arabic script using UTF-16 encoding. The paper discusses the current state of Sindhi Following sections discuss the existing work in corpus construction in detail. Sindhi corpus Pakistani language corpora, orthography and script of development issues including corpus acquisition, Sindhi Language, corpus construction issues, corpus preprocessing, and tokenization are discussed in acquisition, preprocessing, tokenization and results of detail. Preliminary results and observations which preliminary statistical analysis. Finally the future work include letter unigram, bigram and trigram is discussed along-with conclusion. frequencies; word frequencies and word bigram frequencies are presented. Current state of Sindhi 2. Previous work corpus with its limitations and future work is also discussed. The paper also explores the orthography Apart from fonts, keyboard design [3] and few and script of Sindhi language with reference to corpus digital dictionaries [4] Sindhi language processing development. resources are not available publically. Studies or development projects for resources like linguistic 1. Introduction corpora and comprehensive computational lexicon are not even initiated. Various research organizations and Sindhi is one of the major languages of Pakistan individuals are working for the development of spoken by approximately 30-40 million people [1][2]. linguistic corpora of different Pakistani languages. For Sindhi is being frequently used on internet. Sindhi Urdu EMILLE [5], Baker Riaz corpus [6], jang blogs, literary websites, online newspapers and newspaper corpus [7], and parallel English Urdu and discussion forums are increasing day by day. After Nepali corpus [8] are some key examples. For Pashto Urdu Sindhi is the second largest written language of the projects include BBN Byblos Pashto OCR System Pakistan. Despite of its online usage and popularity [9] and Machine readable Pashto text corpus being only few language processing resources are available developed at University of Peshawar [10]. The first for NLP researchers which include lexicon, fonts and Punjabi language corpus was developed by Central simple word processors. The development of Sindhi Institute of Indian Languages (CIIL) India [11]. Hindi language processing resources like linguistic corpora and Punjabi parallel corpus developed by CDAC and comprehensive computational lexicon are not even Noida is another useful linguistic corpora available. initiated. One cannot find such type of linguistic corpora for Sindhi is being written in Persio-Arabic (يڌنس) , Sindhi, Balouchi, Siraiki and many other Pakistani Devnagri (िसÛधी) and roman (sindhi) scripts. Persio- languages. In contrast to other Pakistani languages Arabic script is most common script for Sindhi (Excluding Urdu) Sindhi text in electronic format is writings in Pakistan and India. Devnagri script is also easily available and is being continuously collected for being used for Sindhi writing in India. Roman script corpus under discussion. (though not yet standardized) is also getting popularity. Very few written documents are available in roman 3. Orthography and script of Sindhi script but it is being used frequently for language communications on internet and cell phones and other smart devices. Due to the fact that most of the online Sindhi is written in Persio-Arabic script based on and offline written material of Sindhi is available in extended Arabic character set in Naskh style. Sindhi alphabet is comprised of 52 letters shown in figure 1. 4. Sindhi Corpus Development The alphabet contains basic letters like ٻ ،ب ،ا and secondary letters like ،ھج and ھگ which are aspirated After Unicode support and Unicode based Sindhi versions of ج and گ. keyboard design [13] availability of Unicode based Sindhi text on Internet is increasing day by day. Key factor behind the motivation of Sindhi corpus construction is availability of online text in Sindhi newspapers, blogs, literary websites and discussion forums. Despite of the fact that available online resources do not provide huge amount of text but they are increasing day by day and corpus is being collected continuously. Software routines for preprocessing, normalization, tokenization and frequency calculation are implemented in C# using Microsoft .net framework libraries. 4.1. Corpus Acquisition Figure 1. Sindhi alphabet. Data is gathered from various domains which Sindhi words always end in a vowel [12]; this include news, blogs, literature, essays, and letters. vocalic ending is optionally marked by diacritics in Different subdomains include current affairs, sports, written text. Diacritics are also used inside words to showbiz, short stories, discussions and opinions. represent additional vocalic features. Absence of Sources of data collection are shown in Table 2. diacritics in written text sometimes cause semantic ُ Table 1. Sources of data collection. ambiguities. For instance the word ڻٻد (to push) and ڻٻد َ Source URL(s) (bog) are semantically ambiguous without diacritics. Daily Kawish http://www.thekawish.com Diacritics used in Sindhi are shown in Figure 2. Daily Awami http://www.awamiawaz.com Awaz Figure 2. Diacritics used in Sindhi. Daily Ibrat http://dailyibrat.com Sindhi has its own numerals based on Persio-Arabic Blogs http://shikarpuri.wordpress.com numerals shown in figure 3. Use of Hindu-Arabic numerals is also very common in Sindhi writings. Literary Writings http://voiceofsindh.net Special symbols shown in figure 3 are also used in http://sindhsalamat.com Sindhi written text. 4.2. Preprocessing and Normalization Almost all data gathered was already in Unicode format but nevertheless all the collected text is converted into standard UTF-16 encoding. Letters represented by multiple Unicode points and equivalent representations of composed and decomposed form [14] are reduced to same underlying form. Letters with aspirated versions like ھگ which are combinations of two Unicode characters (for instance گ and ھ in case Figure 3. Special symbols and Numerals used in of ھگ) are considered single letters while dealing with Sindhi written text. text processing. 4.3. Tokenization Table 2.Top 20 most frequent letters. S.No. letter Percent S.No. Letter Percent For tokenization white spaces, punctuation markers, 1 13.77% ي 11 3.25% ڪ special symbols (like $, %, # etc.) and digits are used 2 11.42% ا 12 3.23% س as word boundaries. White space word boundary 3 8.99% ن 13 2.50% د consideration caused problem of embedded space word 4 7.84% و 14 2.00% ب breaking (For example the single word تردق بحاص is 5 6.27% ه 15 1.80% پ divided into two words بحاص and تردق) is tackled out 6 6.15% ر 16 1.18% آ by using the same technique used for Urdu [15]. 7 3.73% م 17 1.16% ڻ Another problem in Sindhi word tokenization occurs 8 3.64% ج 18 1.16% ک when two special words ۾ (in) and ۽ (and) occurred 9 3.30% ل 19 0.99% ع without space like ڻئلام۾ (me:} mila:i®a) and this was 10 3.26% ت 20 0.94% ٽ tokenized as a single word. Also in case ملق۽باتڪ kita:ba ain qalama (book and pen) in which three Table 3. Top 20 bigrams in Sindhi corpus. words without space are there and were tokenized as S.No. Bigram Percent S.No. Bigram Percent single word. Same problem was observed with all the h 1 نا 3.16% 11 نو 1.18% words with non-connective ending like يپريک k i:ra pi:a h 2 يج 2.55% 12 اي 1.10% (drink milk) or starting letters ردناڌنس sind a ander (in 3 ير 1.95% 13 هآ 1.10% Sindh). Semiautomatic (software based + manual) 4 نھ 1.80% 14 وج 1.07% approach was used to overcome this problem. 5 وي 1.79% 15 او 1.02% 5. Results and observations 6 را 1.79% 16 لا 1.01% 7 يھ 1.79% 17 يک 0.99% A total of 4.1 million word corpus analyzed 8 ني 1.69% 18 رو 0.97% quantitatively. This preliminary analysis includes letter 9 هن 1.28% 19 لا 0.95% frequency analysis, letter bigram analysis, letter 10 دن 1.27% 20 يت 0.93% trigram analysis, word frequency analysis, and word bigram analysis. These quantitative results are Table 4. Top 20 letter trigrams in Sindhi Corpus. discussed in following sections. S.No. Trigram Percent S.No. Trigram Percent 1 1.40% يھآ 11 0.45% جنھ 5.1. Letter frequencies 2 1.34% نھن 12 0.44% نھآ 3 0.81% يرا 13 0.44% ويڪ A total of 13,968,112 characters in the corpus were 4 0.74% نوي 14 0.42% هنا analyzed while calculating letter frequencies. Along- 5 0.71% يرڪ 15 0.41% ودن with 52 letters of Sindhi alphabet آ was also considered 6 0.61% ناک 16 0.40% يٹا as a single letter because of its use in Sindhi keyboard 7 0.60% دني 17 0.36% يجن as a single letter and single Unicode representation. It 8 0.53% يدن 18 0.35% نھڏ was observed that most frequently occurred letter was 9 0.47% راو 19 0.35% هنپ vowel ي while least frequently occurred letter was 10 0.46% نام 20 0.35% راد consonant ڱ. Table 2 shows top 20 most frequently occurred letters in Sindhi corpus with their percentage. While analyzing frequencies it was observed that frequency distribution of individual letters in single file of 50,000 or more words was identical to the letter frequency distribution of whole corpus. This similarity can be seen in graphs of figure 3 and 4. Letter bigram and trigram frequencies were also analyzed. It can be seen that almost 50% of top 20 most frequent bigrams are valid two letter words like نا, يج, نھ and يک. Same is the case with trigrams where this ratio is more than 60%. Top 20 most frequent bigram and trigram percentages are shown in Tables 3 Figure 4. Letter frequency distribution in Sindhi and 4 respectively. corpus. Table 6. Top 10 most frequent word bigrams. S.No. Word bigram Percentage 1 هت ويچ 7.52 2 هت يھآ 6.75 3 يج نھ 2.66 4 وٽڀ ريظنيب 1.93 5 يج ڌنس 1.84 6 يج نا 1.72 7 هت نھڏج 1.60 Figure 5. Letter frequency distribution in a single 8 ويچ نھ 1.60 file. 9 ويو ويڪ 1.44 5.2. Word frequencies 10 يھآ ويو 1.21 Total of 4.1 million words were analyzed and absence of standard sentence termination punctuation 70,576 distinct word forms were found. Most marker in Sindhi; full stop comma and other frequently occurring words include case markers (like punctuation markers are used as sentence terminators ۾, يت and ناک) and auxiliary/incomplete verbs (like يھآ in Sindhi text writings. Sentence segmentation is and نھآ). Postposition يج has highest frequency of another key area to be worked out. More specific occurrence as shown in Table 5. Sindhi computational linguistic studies are needed for further development and maturity of corpus. For Table 5. Top 20 most frequent words in Sindhi example currently there is no comprehensive POS corpus. tagging algorithm available for Sindhi. Presently available POS tagging algorithm for Sindhi [16] need S.No. word Percent S.No. word Percent to be analyzed and extended further. Sindhi tagset need 1 يج 3.71% 11 يرڪ 0.69% to be designed before POS tagging of the corpus. 2 ۾ 2.44% 12 ناس 0.69% Qualitative, quantitative improvements, proper 3 ۽ 2.17% 13 نا 0.67% annotations and comprehensive statistical analysis are 4 هت 1.78% 14 ناک 0.63% areas to be extensively worked out. 5 يھآ 1.61% 15 يٿ 0.57% 6 يک 1.61% 16 نھآ 0.55% 7. Conclusion 7 وج 1.50% 17 ءلا 0.51% ِ In absence of language processing resources of 8 يت 1.05% 18 نھ 0.50% Sindhi language Sindhi corpus construction project is a 9 هب 0.82% 19 وھ 0.50% valuable initiative. Regardless of its size and 10 هن 0.71% 20 ويڪ 0.46% preliminary results the corpus in its current state will provide basis for further natural language processing Word bigram occurrences are also calculated and studies of Sindhi language. Letter frequencies are shown in Table 6. The proper name bigram ريظنيب including bigram and trigram frequencies provide basis وٽڀ is among the top 10 bigrams. This is because of the for intelligent text processing and compact keyboard current affairs domain contains essays and newspaper design for cell phones and other smart devices. Word columns about the life of former prime minister level unigram and bigram frequencies provide basis for Benazir Bhutto. spelling corrections and automatic sentence completion applications. Further developments in corpus will be 6. Future work useful for advanced language processing tasks like morphological analysis, syntax analysis, semantic Corpus is being continuously collected and results are analysis, information retrieval and extraction and being updated. Currently corpus is simply a UTF-16 machine translation. encoded text collection. Study are in progress for proper annotations, POS tagging, corpus based lexicon 8. References development and n-gram based text categorization. Sindhi tokenization algorithm need to be worked out [1] Sindhi Language Authority. Official Website. for the problems discussed in section 4.3. Due to http://www.sindhila.org. (Accessed 2010).
no reviews yet
Please Login to review.