161x Filetype PDF File size 0.81 MB Source: www.jetir.org
© 2018 JETIR July 2018, Volume 5, Issue 7 www.jetir.org (ISSN-2349-5162) Classification of Sentences for Paraphrasing in Punjabi Language Ravinder Mohan Jindal Vijay Rana Research Scholar Assistant Professor Sant Baba Bhag Singh University Department of Computer Science and Applications Jalandhar Sant Baba Bhag Singh University Jalandhar ABSTRACT In this research article, author developed an algorithm to classify the Punjabi sentences into simple, compound and complex sentences. This classification is done to assist in generating paraphrases of Punjabi sentences. Author classifies the Punjabi sentences on the basis of length of sentence and other morphological features like presence of non-finite verb, presence of specific postpositions after the root form of verb etc. after applying the proposed algorithm author obtained precision of 100%, recall 99.4% and F-measure 99.69% for simple sentences, precision of 100%, recall 99.15% and F-measure 99.57% for compound sentences and precision of 100%, recall 99.99% and F-measure 99.94% for complex sentences. Keywords- Paraphrasing, sentence simplification, Punjabi sentences. INTRODUCTION Nowadays, research in the field of language processing is growing rapidly. Most of the automatic language processing systems have been developed for English language but not much work has been done in Indian languages. One of such work is to convert the existing sentence in different form by keeping the semantic or meaning same. This will helpful in converting the complex sentence into simpler one. In Natural Language Processing, the technical term used for such task is Paraphrasing. Paraphrasing play very important role in our day today life like when we read the newspaper, checks email or follow some instruction, we interact with the text and it is very important to understand this text. Now if some of these texts are complex then it becomes necessary to simplify them in order to understand these. Paraphrasing is a technique to modify the natural language sentences so that its complexity is reduced and also its readability and understandability is improved. The goal of paraphrasing is to reduce the syntactic complexity of a large sentence so as to help in the development and improvement of various natural language processing tools. Paraphrasing can be helpful in developing and improving many applications in different natural language processing resources like text summarization, machine translation, grammar checking and in assistive technology. In all these resources sentence simplification is used as pre-processor. PUNJABI SENTENCES Like other languages, sentences in Punjabi language also falls into four categories i.e. simple, compound, complex and compound- complex sentences. Each of these four categories has its own characteristics that help in the classification of these sentences. In this research author has used two main characteristic of these sentence i.e. the length of sentence and morphological features of conjunctions (for identification of complex sentences). The main features of simple compound and complex sentences that are used for classification are listed in table 1. JETIR1807185 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org 196 © 2018 JETIR July 2018, Volume 5, Issue 7 www.jetir.org (ISSN-2349-5162) Table 1: Features of simple, compound and complex sentences Length of sentence Presence of Number of verb Number of dependent clause phrases Clauses Simple sentence Less than r equal to No dependent clause One verb phrase one seven Compound More than seven No dependent clause At least two verb At least two Sentence phrases Complex Sentence More than seven Yes contains At least two verb At least two dependent clause phrases EXISTING WORK Various techniques have been proposed by various authors for different languages. These include Bautista, S. et.al (2017) [1] explained a technique to process numerical information present in the large complex sentences. Lee, J. et.al (2017) [2] applied parsing technology for syntactic simplification of English sentences. Narayan, S. et.al (2017) [3] proposed a split and rephrase technique for simplification of complex sentences in English language. Bingel, J. et.al (2016) [4] presented a Conditional Random Field over Dependency based model for text simplification and paraphrasing. Sethi, N. et.al (2016) [5] discussed an approach for reframing the Hindi sentences to generate paraphrases. Narayan, S et.al (2015) [6] presented an unsupervised technique for sentence simplification and this technique was based upon deep semantics. Saini, S et.al (2015) [7] proposed relative clause based sentence simplification method to facilitate English-Hindi machine translation system. Cental, I. et.al (2014) [8] discussed a corpus based approach to syntactically simplify the complex French text in to simple one. Štajner, S. et.al (2013) [9] explained the process of automatic simplification of complex texts in Spanish. Collados, J. C. (2013) [10] used sentence simplification approach for creation of simple Spanish corpus. Author used syntactic simplification split rules, coordination and Subordination to split the large sentence. Wubben, S. et.al (2010) [11] presented a technique for generating paraphrases using monolingual corpus. Petersen, S. E. et.al (2007) [12] performed a detailed analysis of the corpus of news articles and abridged versions written by a literacy organization. Bannard, C. et.al (2005) [13] proposed a method for paraphrasing using bilingual parallel corpora. Inui, K et.al. (2003) [14] described an ongoing research project on text simplification for Japanese language. Knight, K. et.al (2002) [15] presented corpus based sentence compression algorithms using noisy channel and decision tree approach, Rule based sentence simplification approach was proposed by Naushad UzZaman et. al. (2011) [21], identification of complex predicates in Hindi language is developed by Ankit Soni et.al. (2005) [22], clause boundary identification system for Urdu language is developed by Daraksha Parveen et al. (2009) [23]. PROPOSED MODEL As shown in figure 1, input sentence is first checked for its length. The length of sentence is simply the number of words excusing the sentence ender present in the sentence. Further author analyzed 500 compound and complex sentences and observed that more than 97% of the compound sentences have length more than seven. Hence length of sentence is considered as the first criteria for the classification. If the length of sentence is less than seven then it will check for presence of number of verb phrases. Again if the sentence contains only one verb phrase then it is simple sentence. on the other hand if the length of the sentence is greater than seven then the sentence is candidate for compound and complex sentences. Further this type of sentence will be checked for the presence of dependent clause. As per B.S Cheema [24], there are basically four types of dependent clauses in the Punjabi language. These include dependent clause having relative clause, dependent clause having KI clause, dependent clause having adverb clause and dependent clause having non-finite clause. Each of these types has specific morphological features like relative clause start with ਜਜ and ਜਜਜਜਜ conjunctions, KI clause starts with ਜਜ subordinate conjunction, adverb clauses starts with ਜਜ, ਜਜਜਜ, ਜਜਜਜਜ, JETIR1807185 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org 197 © 2018 JETIR July 2018, Volume 5, Issue 7 www.jetir.org (ISSN-2349-5162) ਜਜਜਜਜ, ਜਜਜਜਜ, ਜਜ, ਜਜਜ ਜਜਜਜਜਜ etc. and non-finite clause contains non-finite verbs like ਜਜਜਜਜਜਜ, ਜਜਜਜਜ, ਜਜਜਜਜਜ i.e. contains ਜਜ ਜਜਜਜ and ਜ as postfix with verb. Algorithm used: Step 1: Enter the Punjabi corpus. Step 2: For each sentences in the corpus calculate its length. If length of sentence is less than or equal to 7 then go to step 3otherwise go to step 4. Step 3: check for number of verb phrases present in the sentence. If there is only one verb phrase then it is simple sentence otherwise go to step 4. Step4: Check for the presence of dependent clause using dependent clause features. If dependent clause is present then it is complex sentence otherwise go to step 5. Step 5: check for number of verb phrases present in the sentence. If there is only one verb phrase then it is simple sentence otherwise it is compound sentence. Input Punjabi Sentence Calculate Length of Sentence YES Sentence YES Length < 7 NO Check for presence of contains one dependent clause verb phrase NO Contains more NO Contain NO than one verb depended phrases clause YES YES Simple Sentence Compound Sentence Complex Sentence Figure 1: Proposed flow char of sentence classification JETIR1807185 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org 198 © 2018 JETIR July 2018, Volume 5, Issue 7 www.jetir.org (ISSN-2349-5162) CHALLENGES Various challenges in developing the sentence classification system are: As the Punjabi phrase structure is recursive in nature, therefore it is difficult to identify the dependent clause boundaries [18][19][20]. There are various types of complex sentence in Punjabi language [16][17][24] and all these complex sentences varies in structure hence separate morphological feature has to be used for identification of each type of complex sentence. RESULTS AND DISCUSSION As discussed above, two main parameters used for classification of Punjabi sentences are length and morphological features. For testing the proposed algorithm 15000 sentences were collected from online sources. Out of these 15000 sentences, 2000 simple and 1500 of each type i.e. compound and complex sentences were used to check the effect of length on classification of sentences. The results obtained are shown in tables 2. After applying the complete proposed algorithm the classification results obtained are shown in tables 3.a, 3.b, 3.c and 3.d. Table 2: Results obtained by classifying the sentences on the basis of their length Type of sentence Total number of Number of sentences %age of Number of sentences %age of sentences sentences having length <7 sentences having length >7 having length >7 having length <7 Simple 2000 1991 99.55 9 0.45 Compound 1500 0 0 1500 100 Complex 1500 5 0.4 1495 99.6 Table3.a: Result obtained by applying the proposed algorithm on three datasets Test set Number of Number of Number of Number of Correctly Correctly Correctly number sentences in Simple compound complex classified simple classified classified Sentences in the sentences n the sentences in the sentences by compound complex set corpus corpus corpus proposed system sentences by sentences proposed system by proposed system 1 5000 904 2890 1206 899 2855 1205 2 5000 658 1700 2642 655 1689 2642 3 5000 552 1890 2558 548 1881 2555 Total 15000 2114 6480 6406 2102 6425 6402 JETIR1807185 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org 199
no reviews yet
Please Login to review.