Abstract
Conference Title: 2018 2nd International Conference on Natural Language and Speech Processing (ICNLSP) Conference Start Date: 2018, April 25 Conference End Date: 2018, April 26 Conference Location: Algiers, Algeria Plagiarism detection is a challenging Natural Language Processing (NLP) task. Recently, many systems have been able to detect the simple verbatim reproduction (copy and paste). However, more disguised plagiarism techniques have been used in real plagiarism cases such as: rewording, synonym substitution, paraphrasing and text manipulation, which make the plagiarism detection task much more difficult. In this paper, we propose two approaches devoted to assist users in detecting plagiarism in Arabic natural language texts. The first approach is based on word-embedding, words alignment, and words weighting for the purpose of measuring the semantic similarity relationships among textual units. The second approach is based on Machine Learning (ML), where the characterisation is performed at the sentence level. We combine lexical, syntactic, and semantic features to assist the detection task. The Support Vector Machine (SVM), Decision Trees (DT), and Random Forests (RF) are investigated. The classifiers are trained and evaluated using the training dataset of the first Arabic Plagiarism Detection (AraPlagDet) shared task 2015. Our experimental results show that the proposed approaches achieve promising results compared to state-of-the-art Arabic plagiarism detection systems.