Abstract
Captioning an image is the process of using a visual comprehension system with a model of language, by which we can construct sentences that are meaningful and syntactically accurate for an image. The goal is to train a deep learning model to learn the correspondence between an image and its textual description. This is a challenging task due to the inherent complexity and subjectivity of language, as well as the visual variability of images. Computer vision and natural language processing are both used in the difficult task of image captioning. In this paper, an end to end deep learning-based image captioning system using Inception V3 and Long-Short Term Memory (LSTM) with an attention mechanism is implemented. Expansive experimentation has been realized on one of the benchmark datasets named MS COCO, and the experiential results signify that this intended system is capable of surpassing diverse related systems concerning the extensively utilized measures of evaluation, and the accomplished results were 0.543, 0.87, 0.66, 0.51, 0.42 for Meteor and BLEU(B1-B4), respectively.
Keywords
Attention Mechanism
Image Captioning Generation
Inception V3
LSTM