Abstract
Humans can use informed visual perception to generate sentences by bridging the gap between the recognition of visual features (images) and linguistic expression (words) describing these images. Videos are an example of visual perception; humans can describe the content of the video in meaningful sentences based on understanding their contents as a caption for the video. However, automating the video caption process is a challenging task as it confronts the model with two problems are: object detection and generating a sentence. This research aims to develop a model that automates video captioning based on Encoder-Decoder using a deep learning algorithm following these two steps. Firstly, using the KATNA model to select the most significant frames from the video and remove redundant ones. Secondly, combining the two deep learning algorithms YOLO and LSTM. The You Only Look Once (YOLO) algorithm recognizes objects in the video frames and the Long Short-Term Memory (LSTM) algorithm generates the video caption. The proposed model describes the video's content in a meaningful sentence and it shows good accuracy and efficiency, it applies YOLO on the MSVD dataset unlike other video captions using other deep learning techniques.