Enhancing recurrent neural network-based language models by word tokenization

Hatem M. Noaman; Shahenda S. Sarhan; Mohsen. A. A. Rashwan

doi:10.1186/s13673-018-0133-x

Back

Enhancing recurrent neural network-based language models by word tokenization

Journal article

Open access

Peer reviewed

Enhancing recurrent neural network-based language models by word tokenization

Hatem M. Noaman, Shahenda S. Sarhan and Mohsen. A. A. Rashwan

Human-centric computing and information sciences, Vol.8(1), pp.1-13

27/04/2018

DOI: https://doi.org/10.1186/s13673-018-0133-x

Abstract

Computer Science

Computer Science, Information Systems

Science & Technology

Technology

Different approaches have been used to estimate language models from a given corpus. Recently, researchers have used different neural network architectures to estimate the language models from a given corpus using unsupervised learning neural networks capabilities. Generally, neural networks have demonstrated success compared to conventional n-gram language models. With languages that have a rich morphological system and a huge number of vocabulary words, the major trade-off with neural network language models is the size of the network. This paper presents a recurrent neural network language model based on the tokenization of words into three parts: the prefix, the stem, and the suffix. The proposed model is tested with the English AMI speech recognition dataset and outperforms the baseline n-gram model, the basic recurrent neural network language models (RNNLM) and the GPU-based recurrent neural network language models (CUED-RNNLM) in perplexity and word error rate. The automatic spelling correction accuracy was enhanced by approximately 3.5% for Arabic language misspelling mistakes dataset.

Files and links (1)

url

https://doi.org/10.1186/s13673-018-0133-xView

Published (Version of record) Open

Metrics

1 Record Views

Details

Title: Enhancing recurrent neural network-based language models by word tokenization
Creators - without role: Hatem M. Noaman - Beni-Suef University
Shahenda S. Sarhan - Mansoura University
Mohsen. A. A. Rashwan - Cairo University
Publication Details: Human-centric computing and information sciences, Vol.8(1), pp.1-13
Publisher: Korea Information Processing Soc
Number of pages: 13
Identifiers: 9934131008331
Academic Unit: King Abdulaziz University
Language: English
Resource Type: Journal article