Text encoding for deep learning neural networks: A reversible base 64 (Tetrasexagesimal) Integer Transformation (RIT64) alternative to one hot encoding with applications to Arabic morphology

Mohammed A El Affendi; K H S Al Rajhi

Back

Conference proceeding

Text encoding for deep learning neural networks: A reversible base 64 (Tetrasexagesimal) Integer Transformation (RIT64) alternative to one hot encoding with applications to Arabic morphology

Mohammed A El Affendi and K H S Al Rajhi

The Institute of Electrical and Electronics Engineers, Inc. (IEEE) Conference Proceedings, p.70

01/01/2018

Abstract

Coding

Format

Machine learning

Neural networks

Neurons

Training

Transformations

Wireless communications

Wireless networks

Conference Title: 2018 Sixth International Conference on Digital Information, Networking, and Wireless Communications (DINWC) Conference Start Date: 2018, April 25 Conference End Date: 2018, April 27 Conference Location: Beirut, Lebanon One Hot Encoding (OHE) is currently the norm in text encoding for deep learning neural models. The main problem with OHE is that the size of the input vector, and hence the number of neurons in the input layer, depends on the size of the vocabulary. Experience has shown that the training time for text classification neural models grows exponentially with the size of the vocabulary when OHE is used. For example, if the size of the vocabulary is 10,000, then the size of the input vector will be model 10,000 implying 10,000 neurons in the input layer. This paper proposes and illustrates the use of an alternative Reversible Integer Transformation (RIT) whereby each word in the training/testing set is transformed into base-64 integer format. The transformation is reversible, and the output of the network can easily be converted back to string format (without the need for an index). Another important feature is that each character in the word is represented using only six bits at the appropriate position in the resulting base-64 integer. The maximum number of neurons needed in the input layer is 64, but the actual number of neurons depends on the maximum word length in the vocabulary, and is usually below 64.

Metrics

1 Record Views

Details

Title: Text encoding for deep learning neural networks: A reversible base 64 (Tetrasexagesimal) Integer Transformation (RIT64) alternative to one hot encoding with applications to Arabic morphology
Creators - without role: Mohammed A El Affendi
K H S Al Rajhi
Publication Details: The Institute of Electrical and Electronics Engineers, Inc. (IEEE) Conference Proceedings, p.70
Publisher: The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Identifiers: 9926648708331
Academic Unit: Prince Sultan University
Language: English
Resource Type: Conference proceeding