jagomart
digital resources
picture1_Computer Science Thesis Pdf 106130 | Towards Sindhi Corpus Construction


 186x       Filetype PDF       File size 0.29 MB       Source: www.cle.org.pk


File: Computer Science Thesis Pdf 106130 | Towards Sindhi Corpus Construction
towards sindhi corpus construction mutee u rahman department of computer science isra university hyderabad sindh 71000 pakistan muteeurahman gmail com abstract persio arabic script sindhi corpus being constructed is in ...

icon picture PDF Filetype PDF | Posted on 24 Sep 2022 | 3 years ago
Partial capture of text on file.
                                                Towards Sindhi Corpus Construction 
                                                                             
                                                                             
                                                                 Mutee U Rahman 
                         Department of Computer Science, Isra University, Hyderabad Sindh 71000, Pakistan 
                                                           muteeurahman@gmail.com  
                                                                             
                   
                                        Abstract                              Persio-Arabic script Sindhi corpus being constructed is 
                                                                              in Persio-Arabic script using UTF-16 encoding. 
                     The paper discusses the current state of Sindhi             Following sections discuss the existing work in 
                  corpus construction in detail. Sindhi corpus                Pakistani language corpora, orthography and script of 
                  development issues including corpus acquisition,            Sindhi Language, corpus construction issues, corpus 
                  preprocessing, and tokenization are discussed in            acquisition, preprocessing, tokenization and results of 
                  detail. Preliminary results and observations which          preliminary statistical analysis. Finally the future work 
                  include letter unigram, bigram and trigram                  is discussed along-with conclusion.  
                  frequencies; word frequencies and word bigram                   
                  frequencies are presented. Current state of Sindhi          2. Previous work 
                  corpus with its limitations and future work is also             
                  discussed. The paper also explores the orthography             Apart from fonts, keyboard design [3] and few 
                  and script of Sindhi language with reference to corpus      digital dictionaries [4] Sindhi language processing 
                  development.                                                resources are not available publically. Studies or 
                                                                              development projects for resources like linguistic 
                  1. Introduction                                             corpora and comprehensive computational lexicon are 
                                                                              not even initiated. Various research organizations and 
                     Sindhi is one of the major languages of Pakistan         individuals are working for the development of 
                  spoken by approximately 30-40 million people [1][2].        linguistic corpora  of different Pakistani languages. For 
                  Sindhi is being frequently used on internet. Sindhi         Urdu EMILLE [5], Baker Riaz corpus [6],  jang 
                  blogs, literary websites, online newspapers and             newspaper corpus [7], and parallel English Urdu and 
                  discussion forums are increasing day by day. After          Nepali corpus [8] are some key examples. For Pashto 
                  Urdu Sindhi is the second largest written language of       the projects include BBN Byblos Pashto OCR System 
                  Pakistan. Despite of its online usage and popularity        [9] and  Machine readable Pashto text corpus being 
                  only few language processing resources are available        developed at University of Peshawar [10]. The first 
                  for NLP researchers which include lexicon, fonts and        Punjabi language corpus was developed by Central 
                  simple word processors.  The development of Sindhi          Institute of Indian Languages (CIIL) India [11]. Hindi 
                  language processing resources like linguistic corpora       and Punjabi parallel corpus developed by CDAC 
                  and comprehensive computational lexicon are not even        Noida is another useful linguistic corpora available. 
                  initiated.                                                  One cannot find such type of linguistic corpora for 
                     Sindhi is being written in Persio-Arabic (يڌنس) ,        Sindhi, Balouchi, Siraiki and many other Pakistani 
                  Devnagri (िसÛधी) and roman (sindhi) scripts. Persio-        languages. In contrast to other Pakistani languages 
                  Arabic script is most common script for Sindhi              (Excluding Urdu) Sindhi text in electronic format is 
                  writings in Pakistan and India. Devnagri script is also     easily available and is being continuously collected for 
                  being used for Sindhi writing in India. Roman script        corpus under discussion.  
                  (though not yet standardized) is also getting popularity.       
                  Very few written documents are available in roman           3. Orthography and script of Sindhi 
                  script but it is being used frequently for                  language 
                  communications on internet and cell phones and other            
                  smart devices. Due to the fact that most of the online         Sindhi is written in Persio-Arabic  script based on 
                  and offline written material  of Sindhi is available in     extended Arabic character set in Naskh style. Sindhi 
                  alphabet is comprised of 52 letters shown in figure 1.      4. Sindhi Corpus Development 
                  The alphabet contains basic letters like  ٻ  ،ب  ،ا  and        
                  secondary letters like  ،ھج and ھگ which are aspirated         After Unicode support and Unicode based Sindhi 
                  versions of ج and گ.                                        keyboard design [13] availability of Unicode based 
                                                                              Sindhi text on Internet is increasing day by day. Key 
                                                                              factor behind the motivation of Sindhi corpus 
                                                                              construction is availability of online text in Sindhi 
                                                                              newspapers, blogs, literary websites and discussion 
                                                                              forums. Despite of the fact that available online 
                                                                              resources do not provide huge amount of text but they 
                                                                              are increasing day by day and corpus is being collected 
                                                                              continuously. Software routines for preprocessing,  
                                                                              normalization, tokenization and frequency calculation 
                                                                              are implemented in C# using Microsoft .net framework 
                                                                              libraries. 
                                                                                  
                                                                              4.1. Corpus Acquisition  
                                Figure 1. Sindhi alphabet.                        
                                                                                 Data is gathered from various domains which 
                     Sindhi words always end in a vowel [12]; this            include news, blogs, literature, essays, and letters. 
                  vocalic ending is optionally marked by diacritics in        Different subdomains include current affairs, sports, 
                  written text. Diacritics are also used inside words to      showbiz, short stories, discussions and opinions. 
                  represent additional vocalic features. Absence of           Sources of data collection are shown in Table 2. 
                  diacritics in written text sometimes cause semantic             
                                                                       ُ               Table 1. Sources of data collection. 
                  ambiguities. For instance the word ڻٻد (to push) and ڻٻد 
                                                       َ                             Source URL(s) 
                  (bog) are semantically ambiguous without diacritics.         Daily Kawish         http://www.thekawish.com 
                  Diacritics used in Sindhi are shown in Figure 2. 
                       
                                                                               Daily Awami         http://www.awamiawaz.com 
                                                                               Awaz 
                             Figure 2. Diacritics used in Sindhi.              Daily Ibrat         http://dailyibrat.com 
                      
                     Sindhi has its own numerals based on Persio-Arabic        Blogs http://shikarpuri.wordpress.com
                  numerals shown in figure 3. Use of Hindu-Arabic 
                  numerals is also very common in Sindhi writings.             Literary Writings   http://voiceofsindh.net 
                  Special symbols shown in figure 3 are also used in                               http://sindhsalamat.com  
                  Sindhi written text.                                         
                                                                              4.2. Preprocessing and Normalization 
                                                                                  
                                                                                 Almost all data gathered was already in Unicode 
                                                                              format but nevertheless all the collected text is 
                                                                              converted into standard UTF-16 encoding. Letters 
                                                                              represented by multiple Unicode points and equivalent 
                                                                              representations of composed and decomposed form 
                                                                              [14] are reduced to same underlying form. Letters with 
                                                                              aspirated versions like ھگ which are combinations of 
                                                                              two Unicode characters (for instance گ and ھ in case 
                      Figure 3. Special symbols and Numerals used in          of ھگ)  are considered single letters while dealing with 
                                   Sindhi written text.                       text processing.  
                                                                                  
                   4.3. Tokenization                                                       Table 2.Top 20 most frequent letters. 
                                                                                    S.No. letter     Percent    S.No. Letter Percent 
                      For tokenization white spaces, punctuation markers,             1       13.77% ي           11  3.25% ڪ
                   special symbols (like $, %, # etc.) and digits are used            2        11.42% ا          12  3.23% س
                   as word boundaries. White space word boundary                      3       8.99% ن            13  2.50% د
                   consideration caused problem of embedded space word                4       7.84% و            14  2.00% ب
                   breaking (For example the single word تردق بحاص is                 5        6.27% ه           15  1.80% پ
                   divided into two words بحاص and تردق) is tackled out               6       6.15% ر            16  1.18% آ
                   by using the same technique used for Urdu [15].                    7       3.73% م            17  1.16% ڻ
                   Another problem in Sindhi word tokenization occurs                 8       3.64% ج            18  1.16% ک
                   when two special words ۾ (in) and ۽ (and) occurred                 9       3.30% ل            19  0.99% ع
                   without space like ڻئلام۾ (me:} mila:i®a) and this was             10      3.26% ت            20  0.94% ٽ
                   tokenized as a single word. Also in case  ملق۽باتڪ              
                   kita:ba ain qalama (book and pen) in which three                     Table 3. Top 20 bigrams in Sindhi corpus. 
                   words without space are there and were tokenized as              S.No. Bigram Percent S.No.  Bigram Percent
                   single word. Same problem  was observed with all the 
                                                                    h                 1        نا     3.16%       11        نو     1.18% 
                   words with non-connective ending like يپريک k i:ra pi:a 
                                                                h                     2        يج     2.55%       12         اي    1.10% 
                   (drink milk) or starting letters ردناڌنس sind a ander (in          3        ير     1.95%       13        هآ     1.10% 
                   Sindh). Semiautomatic (software based + manual)                    4        نھ     1.80%       14        وج     1.07% 
                   approach was used to overcome this problem.                        5        وي     1.79%       15        او     1.02% 
                       
                   5. Results and observations                                        6        را     1.79%       16        لا     1.01% 
                                                                                      7        يھ     1.79%       17        يک     0.99% 
                      A total of 4.1 million word corpus analyzed                     8        ني     1.69%       18        رو     0.97% 
                   quantitatively. This preliminary analysis includes letter          9        هن     1.28%       19        لا     0.95% 
                   frequency analysis, letter bigram analysis, letter                10        دن     1.27%       20        يت     0.93% 
                   trigram analysis, word frequency analysis, and word             
                   bigram analysis. These quantitative results are                  Table 4. Top 20 letter trigrams in Sindhi Corpus. 
                   discussed in following sections.                                S.No. Trigram Percent S.No.  Trigram Percent
                                                                                     1       1.40% يھآ           11  0.45% جنھ
                   5.1. Letter frequencies                                           2       1.34% نھن           12  0.44% نھآ
                                                                                     3       0.81% يرا           13  0.44% ويڪ
                      A total of 13,968,112 characters in the corpus were            4       0.74% نوي           14  0.42% هنا
                   analyzed while calculating letter frequencies. Along-             5      0.71% يرڪ            15  0.41% ودن
                   with 52 letters of Sindhi alphabet آ was also considered          6       0.61% ناک           16  0.40% يٹا
                   as a single letter because of its use in Sindhi keyboard          7        0.60% دني          17  0.36% يجن
                   as a single letter and single Unicode representation. It          8       0.53% يدن           18  0.35% نھڏ
                   was observed that most frequently occurred letter was             9       0.47% راو           19  0.35% هنپ
                   vowel  ي while least frequently occurred letter was              10       0.46% نام           20  0.35% راد
                   consonant ڱ. Table 2 shows top 20 most frequently               
                   occurred letters in Sindhi corpus with their percentage.        
                      While analyzing frequencies it was observed that 
                   frequency distribution of individual letters in single 
                   file of 50,000 or more words was identical  to the letter 
                   frequency distribution of whole corpus. This similarity 
                   can be seen in graphs of figure 3 and 4. 
                      Letter bigram and trigram frequencies were also 
                   analyzed. It can be seen that almost 50% of top 20 
                   most frequent bigrams are valid two letter words like 
                   نا, يج, نھ and يک. Same is the case with trigrams where 
                   this ratio is more than 60%. Top 20 most frequent                                                                        
                   bigram and trigram percentages are shown in Tables 3             Figure 4. Letter frequency distribution in Sindhi 
                   and 4 respectively.                                                                     corpus. 
                                                                                      
                                                                                        Table 6. Top 10 most frequent word bigrams. 
                                                                                             S.No.     Word bigram         Percentage 
                                                                                               1            هت ويچ 7.52 
                                                                                               2            هت يھآ 6.75 
                                                                                               3            يج نھ 2.66 
                                                                                               4          وٽڀ ريظنيب 1.93 
                                                                                               5           يج ڌنس 1.84 
                                                                                               6            يج نا 1.72 
                                                                                               7           هت نھڏج 1.60 
                                                                              
                     Figure 5. Letter frequency distribution in a single                       8            ويچ نھ 1.60 
                                               file.                                           9           ويو ويڪ 1.44 
                    5.2. Word frequencies                                                     10           يھآ ويو 1.21 
                                                                                      
                       Total of 4.1 million words were analyzed and                  absence of standard sentence termination punctuation 
                    70,576 distinct word forms were found. Most                      marker in Sindhi; full stop comma and other 
                    frequently occurring words include case markers (like            punctuation markers are used as sentence terminators 
                    ۾, يت and ناک) and auxiliary/incomplete verbs (like يھآ          in Sindhi text writings. Sentence segmentation is 
                    and  نھآ). Postposition يج has highest frequency of              another key area to be worked out.  More specific 
                    occurrence as shown in Table 5.                                  Sindhi computational linguistic studies are needed for 
                                                                                     further development and maturity of corpus. For 
                       Table 5. Top 20 most frequent words in Sindhi                 example currently there is no comprehensive POS 
                                             corpus.                                 tagging algorithm available for Sindhi. Presently 
                                                                                     available POS tagging algorithm for Sindhi [16] need 
                      S.No.  word  Percent  S.No.  word  Percent                     to be analyzed and extended further. Sindhi tagset need 
                        1        يج     3.71%       11      يرڪ     0.69%            to be designed before POS tagging of the corpus. 
                        2         ۾     2.44%       12       ناس    0.69%            Qualitative, quantitative improvements, proper 
                        3         ۽     2.17%       13        نا    0.67%            annotations and comprehensive statistical analysis are 
                        4        هت     1.78%       14       ناک    0.63%            areas to be extensively worked out. 
                        5       يھآ     1.61%       15        يٿ    0.57%             
                        6        يک     1.61%       16       نھآ    0.55%            7. Conclusion 
                        7        وج     1.50%       17        ءلا   0.51%             
                                                             ِ                          In absence of language processing resources of 
                        8        يت     1.05%       18       نھ     0.50%            Sindhi language Sindhi corpus construction project is a 
                        9        هب     0.82%       19        وھ    0.50%            valuable initiative. Regardless of its size and 
                        10       هن     0.71%       20       ويڪ    0.46%            preliminary results the corpus in its current state will 
                                                                                     provide basis for further natural language processing 
                       Word bigram occurrences are also calculated and               studies of Sindhi language. Letter frequencies 
                    are shown in Table 6. The proper name bigram  ريظنيب             including bigram and trigram frequencies provide basis 
                    وٽڀ is among the top 10 bigrams. This is because of the          for intelligent text processing and compact keyboard 
                    current affairs domain contains essays and newspaper             design for cell phones and other smart devices. Word 
                    columns about the life of former prime minister                  level unigram and bigram frequencies provide basis for 
                    Benazir Bhutto.                                                  spelling corrections and automatic sentence completion 
                                                                                     applications. Further developments in corpus will be 
                    6. Future work                                                   useful for advanced language processing tasks like 
                                                                                     morphological analysis, syntax analysis, semantic 
                    Corpus is being continuously collected and results are           analysis, information retrieval and extraction and 
                    being updated. Currently corpus is simply a UTF-16               machine translation. 
                    encoded text collection. Study are in progress for                
                    proper annotations, POS tagging, corpus based lexicon            8. References 
                    development and n-gram based text categorization.                 
                    Sindhi tokenization algorithm need to be worked out              [1] Sindhi Language Authority. Official Website. 
                    for  the  problems  discussed in  section 4.3.  Due  to          http://www.sindhila.org. (Accessed 2010). 
The words contained in this file might help you see if this file matches what you are looking for:

...Towards sindhi corpus construction mutee u rahman department of computer science isra university hyderabad sindh pakistan muteeurahman gmail com abstract persio arabic script being constructed is in using utf encoding the paper discusses current state following sections discuss existing work detail pakistani language corpora orthography and development issues including acquisition preprocessing tokenization are discussed results preliminary observations which statistical analysis finally future include letter unigram bigram trigram along with conclusion frequencies word presented previous its limitations also explores apart from fonts keyboard design few reference to digital dictionaries processing resources not available publically studies or projects for like linguistic introduction comprehensive computational lexicon even initiated various research organizations one major languages individuals working spoken by approximately million people different frequently used on internet urdu ...

no reviews yet
Please Login to review.