Language Pdf 101709 | Jetir1807185

Partial capture of text on file.

© 2018 JETIR July 2018, Volume 5, Issue 7 www.jetir.org (ISSN-2349-5162)
Classification of Sentences for Paraphrasing in
Punjabi Language

Ravinder Mohan Jindal Vijay Rana
Research Scholar Assistant Professor
Sant Baba Bhag Singh University Department of Computer Science and Applications
Jalandhar Sant Baba Bhag Singh University
Jalandhar

ABSTRACT

In this research article, author developed an algorithm to classify the Punjabi sentences into simple, compound and complex
sentences. This classification is done to assist in generating paraphrases of Punjabi sentences. Author classifies the Punjabi sentences
on the basis of length of sentence and other morphological features like presence of non-finite verb, presence of specific
postpositions after the root form of verb etc. after applying the proposed algorithm author obtained precision of 100%, recall 99.4%
and F-measure 99.69% for simple sentences, precision of 100%, recall 99.15% and F-measure 99.57% for compound sentences and
precision of 100%, recall 99.99% and F-measure 99.94% for complex sentences.

Keywords- Paraphrasing, sentence simplification, Punjabi sentences.

INTRODUCTION

Nowadays, research in the field of language processing is growing rapidly. Most of the automatic language processing systems have
been developed for English language but not much work has been done in Indian languages. One of such work is to convert the
existing sentence in different form by keeping the semantic or meaning same. This will helpful in converting the complex sentence
into simpler one. In Natural Language Processing, the technical term used for such task is Paraphrasing. Paraphrasing play very
important role in our day today life like when we read the newspaper, checks email or follow some instruction, we interact with the
text and it is very important to understand this text. Now if some of these texts are complex then it becomes necessary to simplify
them in order to understand these. Paraphrasing is a technique to modify the natural language sentences so that its complexity is
reduced and also its readability and understandability is improved. The goal of paraphrasing is to reduce the syntactic complexity
of a large sentence so as to help in the development and improvement of various natural language processing tools. Paraphrasing
can be helpful in developing and improving many applications in different natural language processing resources like text
summarization, machine translation, grammar checking and in assistive technology. In all these resources sentence simplification
is used as pre-processor.

PUNJABI SENTENCES

Like other languages, sentences in Punjabi language also falls into four categories i.e. simple, compound, complex and compound-
complex sentences. Each of these four categories has its own characteristics that help in the classification of these sentences. In this
research author has used two main characteristic of these sentence i.e. the length of sentence and morphological features of
conjunctions (for identification of complex sentences). The main features of simple compound and complex sentences that are used
for classification are listed in table 1.

JETIR1807185 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org 196

© 2018 JETIR July 2018, Volume 5, Issue 7 www.jetir.org (ISSN-2349-5162)
Table 1: Features of simple, compound and complex sentences

Length of sentence Presence of Number of verb Number of
dependent clause phrases Clauses
Simple sentence Less than r equal to No dependent clause One verb phrase one
seven
Compound More than seven No dependent clause At least two verb At least two
Sentence phrases
Complex Sentence More than seven Yes contains At least two verb At least two
dependent clause phrases

EXISTING WORK

Various techniques have been proposed by various authors for different languages. These include Bautista, S. et.al (2017) [1]
explained a technique to process numerical information present in the large complex sentences. Lee, J. et.al (2017) [2] applied
parsing technology for syntactic simplification of English sentences. Narayan, S. et.al (2017) [3] proposed a split and rephrase
technique for simplification of complex sentences in English language. Bingel, J. et.al (2016) [4] presented a Conditional Random
Field over Dependency based model for text simplification and paraphrasing. Sethi, N. et.al (2016) [5] discussed an approach for
reframing the Hindi sentences to generate paraphrases. Narayan, S et.al (2015) [6] presented an unsupervised technique for sentence
simplification and this technique was based upon deep semantics. Saini, S et.al (2015) [7] proposed relative clause based sentence
simplification method to facilitate English-Hindi machine translation system. Cental, I. et.al (2014) [8] discussed a corpus based
approach to syntactically simplify the complex French text in to simple one. Štajner, S. et.al (2013) [9] explained the process of
automatic simplification of complex texts in Spanish. Collados, J. C. (2013) [10] used sentence simplification approach for creation
of simple Spanish corpus. Author used syntactic simplification split rules, coordination and Subordination to split the large sentence.
Wubben, S. et.al (2010) [11] presented a technique for generating paraphrases using monolingual corpus. Petersen, S. E. et.al (2007)
[12] performed a detailed analysis of the corpus of news articles and abridged versions written by a literacy organization. Bannard,
C. et.al (2005) [13] proposed a method for paraphrasing using bilingual parallel corpora. Inui, K et.al. (2003) [14] described an
ongoing research project on text simplification for Japanese language. Knight, K. et.al (2002) [15] presented corpus based sentence
compression algorithms using noisy channel and decision tree approach, Rule based sentence simplification approach was proposed
by Naushad UzZaman et. al. (2011) [21], identification of complex predicates in Hindi language is developed by Ankit Soni et.al.
(2005) [22], clause boundary identification system for Urdu language is developed by Daraksha Parveen et al. (2009) [23].

PROPOSED MODEL
As shown in figure 1, input sentence is first checked for its length. The length of sentence is simply the number of words excusing
the sentence ender present in the sentence. Further author analyzed 500 compound and complex sentences and observed that more
than 97% of the compound sentences have length more than seven. Hence length of sentence is considered as the first criteria for
the classification. If the length of sentence is less than seven then it will check for presence of number of verb phrases. Again if the
sentence contains only one verb phrase then it is simple sentence. on the other hand if the length of the sentence is greater than
seven then the sentence is candidate for compound and complex sentences. Further this type of sentence will be checked for the
presence of dependent clause. As per B.S Cheema [24], there are basically four types of dependent clauses in the Punjabi language.
These include dependent clause having relative clause, dependent clause having KI clause, dependent clause having adverb clause
and dependent clause having non-finite clause. Each of these types has specific morphological features like relative clause start with
ਜਜ and ਜਜਜਜਜ conjunctions, KI clause starts with ਜਜ subordinate conjunction, adverb clauses starts with ਜਜ, ਜਜਜਜ, ਜਜਜਜਜ,
JETIR1807185 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org 197

© 2018 JETIR July 2018, Volume 5, Issue 7 www.jetir.org (ISSN-2349-5162)
ਜਜਜਜਜ, ਜਜਜਜਜ, ਜਜ, ਜਜਜ ਜਜਜਜਜਜ etc. and non-finite clause contains non-finite verbs like ਜਜਜਜਜਜਜ, ਜਜਜਜਜ, ਜਜਜਜਜਜ i.e.
contains ਜਜ ਜਜਜਜ and ਜ as postfix with verb.
Algorithm used:
Step 1: Enter the Punjabi corpus.
Step 2: For each sentences in the corpus calculate its length.
If length of sentence is less than or equal to 7 then go to step 3otherwise go to step 4.
Step 3: check for number of verb phrases present in the sentence.
If there is only one verb phrase then it is simple sentence otherwise go to step 4.
Step4: Check for the presence of dependent clause using dependent clause features.
If dependent clause is present then it is complex sentence otherwise go to step 5.
Step 5: check for number of verb phrases present in the sentence.
If there is only one verb phrase then it is simple sentence otherwise it is compound sentence.
Input Punjabi Sentence
Calculate Length of Sentence
YES Sentence YES Length < 7 NO Check for presence of
contains one dependent clause
verb phrase
NO
Contains more NO Contain
NO than one verb depended
phrases clause
YES
YES
Simple Sentence Compound Sentence Complex Sentence

Figure 1: Proposed flow char of sentence classification

JETIR1807185 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org 198

Various challenges in developing the sentence classification system are:
 As the Punjabi phrase structure is recursive in nature, therefore it is difficult to identify the dependent clause boundaries
[18][19][20].
 There are various types of complex sentence in Punjabi language [16][17][24] and all these complex sentences varies in
structure hence separate morphological feature has to be used for identification of each type of complex sentence.

RESULTS AND DISCUSSION
As discussed above, two main parameters used for classification of Punjabi sentences are length and morphological features. For
testing the proposed algorithm 15000 sentences were collected from online sources. Out of these 15000 sentences, 2000 simple and
1500 of each type i.e. compound and complex sentences were used to check the effect of length on classification of sentences. The
results obtained are shown in tables 2. After applying the complete proposed algorithm the classification results obtained are shown
in tables 3.a, 3.b, 3.c and 3.d.

Table 2: Results obtained by classifying the sentences on the basis of their length

Type of sentence Total number of Number of sentences %age of Number of sentences %age of sentences
sentences having length <7 sentences having length >7 having length >7
having length
<7
Simple 2000 1991 99.55 9 0.45
Compound 1500 0 0 1500 100
Complex 1500 5 0.4 1495 99.6

Table3.a: Result obtained by applying the proposed algorithm on three datasets
Test set Number of Number of Number of Number of Correctly Correctly Correctly
number sentences in Simple compound complex classified simple classified classified
Sentences in the sentences n the sentences in the sentences by compound complex
set corpus corpus corpus proposed system sentences by sentences
proposed system by
proposed
system
1 5000 904 2890 1206 899 2855 1205
2 5000 658 1700 2642 655 1689 2642
3 5000 552 1890 2558 548 1881 2555
Total 15000 2114 6480 6406 2102 6425 6402

JETIR1807185 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org 199

The words contained in this file might help you see if this file matches what you are looking for:

...Jetir july volume issue www org issn classification of sentences for paraphrasing in punjabi language ravinder mohan jindal vijay rana research scholar assistant professor sant baba bhag singh university department computer science and applications jalandhar abstract this article author developed an algorithm to classify the into simple compound complex is done assist generating paraphrases classifies on basis length sentence other morphological features like presence non finite verb specific postpositions after root form etc applying proposed obtained precision recall f measure keywords simplification introduction nowadays field processing growing rapidly most automatic systems have been english but not much work has indian languages one such convert existing different by keeping semantic or meaning same will helpful converting simpler natural technical term used task play very important role our day today life when we read newspaper checks email follow some instruction interact with ...

Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area