Figure - available from: Neural Computing and Applications
This content is subject to copyright. Terms and conditions apply.
Long short-term memory network graphical representation as depicted in the proposed model. Three gates (input gate, output gate and forget gate) control the memory cell state c

Long short-term memory network graphical representation as depicted in the proposed model. Three gates (input gate, output gate and forget gate) control the memory cell state c

Source publication
Article
Full-text available
The automatic narration of a natural scene is an important trait in artificial intelligence that unites computer vision and natural language processing. Caption generation is a challenging task in scene understanding. Most of the state-of-the-art methods are using deep convolutional neural network models to extract visual features of the entire ima...

Similar publications

Article
Full-text available
Understanding and identifying emotional cues in human speech is a crucial aspect of human-computer communication. The application of computer technology in dissecting and deciphering emotions, along with the extraction of relevant emotional characteristics from speech, forms a significant part of this process. The objective of this study was to arc...

Citations

... The convolutional layer lies at the heart of the CNN architecture and plays a critical role, as illustrated in Fig. 8. Fig. 8 Convolution Layer [23] E. LSTM LSTM emerged as a network model to tackle the persistent issues of gradient expansion and gradient vanishing that plagued RNNs [24]. With its inherent memory and capacity for accurate predictions, it has been widely adopted in applications such as speech recognition, sentiment analysis, and text analysis [25]. Moreover, it has gained recent popularity in the realm of stock market forecasting [26]. ...
... Feature extraction is a pivotal stage in numerous machine learning applications, as it generates valuable insights for prediction models, enhancing their accuracy [25] [29]. Timeseries problems are no exception in this context. ...
Article
Full-text available
Cardiovascular disorders are among the primary causes of death. Regularly monitoring the heart is of paramount importance in preventing fatalities arising from heart diseases. Heart disease monitoring encompasses various approaches, including the analysis of heartbeat sounds. The auditory patterns of a heartbeat can serve as indicators of heart health. This study aims to build a new model for categorizing heartbeat sounds based on associated ailments. The Phonocardiogram (PCG) method digitizes and records heartbeat sounds. By converting heartbeat sounds into digital data, researchers are empowered to develop a deep learning model capable of discerning heart defects based on distinct cardiac rhythms. This study proposes the utilization of Mel-frequency cepstral coefficients for feature extraction, leveraging their application in voice data analysis. These extracted features are subsequently employed in a multi-step classification process. The classification process merges a convolutional neural network (CNN) with a long short-term memory network (LSTM), forming a comprehensive deep learning architecture. This architecture is further enhanced through optimization utilizing the Adagrad optimizer. To examine the effectiveness of the proposed method, its classification performance is evaluated using the "Heartbeat Sounds" dataset sourced from Kaggle. Experimental results underscore the effectiveness of the proposed method by comparing it with simple CNN, CNN with vanilla LSTM, and traditional machine learning methods (MLP, SVM, Random Forest, and k-NN).
... To obtain more detailed information from the image, Anderson et al. [6] developed an approach that encodes the image using a pretrained object detector to detect a set of objects in the image. The success of this method [23] has made it a standard approach in recent visual language research. Neural networks not only optimize models but also optimize training methods [24,25]. ...
Article
Full-text available
The objective of image captioning is to provide precise descriptions of depicted objects and their relationships. To perform this task, previous studies have mainly relied on region features or a combination of these features and geometric coordinates. However, a significant limitation of these methods is their failure to incorporate grid features and their geometric coordinates, resulting in captions that inadequately identify object-related information within the global context. To overcome this limitation, we employ Swin Transformer and Deformable DETR to extract new grid and region features, along with their respective coordinates. Subsequently, we integrate the geometric coordinates of grids and regions into their corresponding features and incorporate grid features into the region features. The previously obtained features in the encoder are then used to generate text in the decoder. Through quantitative and qualitative analysis of the experimental results, our novel features and caption model have demonstrated superiority over previous methods. Specifically, our approach achieves superior inference accuracy on the COCO and Nocaps image captioning benchmarks. Compared to the baseline method, our model exhibits a 4.3% improvement, reaching a score of 136.9 on the CIDEr evaluation metric.
... It works well in the area of solving time-series problems. After it was carried out in 1997, people have used it to deal with problems like written number pattern recognition, text analysis, gesture recognition and speech recognition and LSTM can make relatively accurate forecasting [9]. ...
Article
The analysis of stock price fluctuations holds considerable significance in the field of economics, particularly given the present environment characterized by unpredictability and rapid changes. Previously, the long short-term memory (LSTM) model has been employed effectively in addressing time series problems, including stock market forecasting. However, in the current dynamic landscape, the ability of LSTM to adapt to volatile conditions and provide accurate predictions is an area that merits further investigation. This study gathers stock data from prominent and representative companies, namely Apple, Google, Amazon, and Microsoft, spanning from January 2012 to March 2023. Specifically, two significant events are examined: the impact of the Covid-19 outbreak on the US stock market on February 26, 2020, and the Russia-Ukraine conflict occurring on February 26, 2022. By dividing the stock data surrounding these events into training and test sets, this research aims to evaluate the differential performance of LSTM in scenarios where it possesses no prior knowledge of these events versus situations where it has already assimilated the influence exerted by them.
... Similarly, LSTM is a network model designed to solve the longstanding problems of gradient explosion and gradient disappearance in RNN (Zarrad et al., 2019). It has been widely used in speech recognition, emotional analysis, and text analysis, as it has its own memory and can make relatively accurate forecasting (Gupta & Jalal, 2019). In recent years, it has also been adopted in the field of stock market forecasting (Yadav et al., 2020).There is only one repeating module in a standard RNN, and www.scienceworldjournal.org ISSN: 1597-6343 (Online), ISSN: 2756-391X (Print) Published by Faculty of Science, Kaduna State University its internal structure is simple. ...
Article
The significant growth in the use of the Internet and the rapid development of network technologies are associated with an increased risk of network attacks. As the use of encryption protocols increases, so does the challenge of identifying malware encrypted traffic also increases. Malware is a threat to people in the cyber world, as it steals personal information and harms computer systems. Network attacks refer to all types of unauthorized access to a network, including any attempts to damage and disrupt the network. This often leads to serious consequences. However, various researchers, developers and information security specialists around the globe continuously work on strategies for detecting malware. Recently, deep learning has been successfully applied to network security assessments and intrusion detection systems (IDSs) with various breakthroughs, such as using Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) to classify malicious traffic. But, with the diverse nature of malware, it is difficult to extract features from it. Therefore, existing solutions require more computing resources since available resources are not efficient for datasets with large numbers of samples. Also, adopting existing feature extractors for extracting features of images consumes more resources. This paper therefore solved these problems by combining a 1D convolutional neural network (CNN) and long short-term memory (LSTM) to adequately detect and classify malicious encrypted traffic. This work was conducted on the malware Analysis benchmark Datasets with API Call Sequences, which contains 42,797 malwares and 1,079 goodware API call sequences. The experimental results show that our proposed system has achieved 99.2% accuracy and outperformed all other state-of-the-art models.
... It was developed to address the shortcomings of traditional recurrent neural networks (RNN), which could not learn long-term relationships between data points. LSTM has a special mechanism called a "Memory cell" that allows it to temporarily store data, enabling it to learn and remember long sequences [12,13]. With the capabilities of the LSTM, it is used in tasks like image captioning that can handle long-term dependencies and can learn more complex patterns better than RNN. ...
Article
Full-text available
Nowadays, images are being used more extensively for communication purposes. A single image can convey a variety of stories, depending on the perspective and thoughts of everyone who views it. To facilitate comprehension, inclusion image captions is highly beneficial, especially for individuals with visual impairments who can read Braille or rely on audio descriptions. The purpose of this research is to create an automatic captioning system that is easy to understand and quick to generate. This system can be applied to other related systems. In this research, the transformer learning process is applied to image captioning instead of the convolutional neural networks (CNN) and recurrent neural networks (RNN) process which has limitations in processing long-sequence data and managing data complexity. The transformer learning process can handle these limitations well and more efficiently. Additionally, the image captioning system was trained on a dataset of 5,000 images from Instagram that were tagged with the hashtag "Phuket" (#Phuket). The researchers also wrote the captions themselves to use as a dataset for testing the image captioning system. The experiments showed that the transformer learning process can generate natural captions that are close to human language. The generated captions will also be evaluated using the Bilingual Evaluation Understudy (BLEU) score and Metric for Evaluation of Translation with Explicit Ordering (METEOR) score, a metric for measuring the similarity between machine-translated text and human-written text. This will allow us to compare the resemblance between the researcher-written captions and the transformer-generated captions.
... In 2020, Gupta et al. [35] studied the possibility of more accurate scene captioning by combining the text that is already present in a picture. In this study, we propose a model to combine words found in an image with visual data gathered using cutting-edge methods accuracy is enhanced by image captioning. ...
Article
Full-text available
Artificial intelligence’s crucial area of image captions. It’s a very difficult situation until the advancement of DL is made. A lot of open challenges remain as robustness, generalization and accuracy, results are far from reasonable. As image captioning schemes are data avaricious, pre-training on larger scale datasets, even if not well-curated, is fetching a solid approach. In addition to precisely identifying the image includes the scene, object, connection, and qualities of the item in the image, the image caption generation method should produce natural, fluid, precise, and useful sentences. However, since not all visual information may be utilized, it might be difficult to effectively convey the image’s content when writing image captions. Here, the image captioning is done under two models, i.e. NIC model and LSTM based model. At first, (Neural Image Caption) NIC process is done, where, CNN based caption generation is carried out for unlabelled and labeled dataset. Further, features namely, improved BOW and N-gram are derived that are used for training the CNN model. The final caption is generated by optimized LSTM, where the weights are optimally tuned by Harris Hawks with Sinusoidal Chaotic Map Assisted Exploitation (HH-SCME). Finally, BLEU score, rouge and CIDER scores are computed to prove the efficiency of HH-SCME. The proposed model of LSTM+HH-SCME achieves 0.84 BLEU score 1 value as compared to other existing methods like CNN, SSO, PRO, AOA, RNN, LSTM and LSTM+HH-SCME.
... An image caption generator employs computer vision and natural language processing methods to comprehend the context of a picture and describe it in a language like English [1]. The goal of this research is to introduce readers to the concepts behind a CNN and LSTM model and show them how to utilize them to build an image caption generator [2]. ...
Article
Full-text available
Can a machine interpret an image's meaning with the same speed as the human brain when it is seen? This problem was heavily researched by computer vision specialists, who believed it to be unsolvable until recently. It is now possible to develop models that can generate captions for pictures because of advancements in deep learning techniques, accessibility to large datasets, and processing power. This will be accomplished by the Python-based implementation of the article's deep learning convolutional neural network technique and a particular kind of recurrent neural network. Here the proposed model uses CNN and LSTM methods to achieve desired task
... Also, they proposed variants of LSTM on the mentioned dataset and concluded that Bag-LSTM performs better on CIDEr value. In the paper [20], authors proposed fusion based text feature extraction for image captioning using DNN(deep neural network) with LSTM. They evaluated the proposed model on Fliker30k dataset. ...
Article
Full-text available
Image captioning is an interesting and challenging task with applications in diverse domains such as image retrieval, organizing and locating images of users’ interest, etc. It has huge potential for replacing manual caption generation for images and is especially suitable for large-scale image data. Recently, deep neural network based methods have achieved great success in the field of computer vision, machine translation, and language generation. In this paper, we propose an encoder-decoder based model that is capable of generating grammatically correct captions for images. This model makes use of VGG16 Hybrid Places 1365 as an encoder and LSTM as a decoder. To ensure the complete ground truth accuracy, the model is trained on the labeled Flickr8k and MS-COCO Captions datasets., Further, the model is evaluated using all popular standard metrics such as BLEU, METEOR, GLEU, and ROUGE_L. Experimental results indicate that the proposed model obtained a BLEU-1 score of 0.6666, METEOR score of 0.5060, and GLEU score of 0.2469 on the Flickr8k dataset and BLEU-1 score 0.7350, METEOR score of 0.4768 and GLEU score 0.2798 on MS-COCO Caption dataset. Thus, the proposed method achieved a significant performance as compared to the state-of-art approaches. To evaluate the efficacy of the model further, we also show the results of caption generation from live sample images that reinforce the validity of the proposed approach.
... Scene graphs are employed to represent the detected objects and their relationships and these structured representations are applied to generate description sentences [14,28]. Gupta et al. [29] investigated that fusion of text available in an image can give more fined-grained captioning of a scene. In recent years, the Vision-Language Pre-training(VLP) [30,31] model has helped downstream tasks to achieve excellent results by pre-training the alignment of vision and language according to the large-scale corpus. ...
Article
Full-text available
Automatically generating descriptions for disaster news images could effectively accelerate the spread of disaster message and lighten the burden of news editors from tedious news materials. Image caption algorithms are remarkable for generating captions directly from the content of the image. However, current image caption algorithms trained on existing image caption datasets fail to describe the disaster images with fundamental news elements. In this paper, we developed a large-scale disaster news image Chinese caption dataset (DNICC19k), which collected and annotated enormous news images related to disaster. Furthermore, we proposed a spatial-aware topic driven caption network (STCNet) to encode the interrelationships between these news objects and generate descriptive sentences related to news topics. STCNet firstly constructs a graph representation based on objects feature similarity. The graph reasoning module uses the spatial information to infer the weights of aggregated adjacent nodes according to a learnable Gaussian kernel function. Finally, the generation of news sentences are driven by the spatial-aware graph representations and the news topics distribution. Experimental results demonstrate that STCNet trained on DNICC19k could not only automatically creates descriptive sentences related to news topics for disaster news images, but also outperforms benchmark models such as Bottom-up, NIC, Show attend and AoANet on multiple evaluation metrics, achieving CIDEr/BLEU-4 scores of 60.26 and 17.01, respectively.
... Various state-of-the-art techniques and model have been published in previous years to generate the human like captions. Image captioning approaches [11], [14] and [17] broadly classified in to Template-based [18][19][20][21], Retrievalbased [22][23][24][25][26], and Encoder-decoder methods [27][28][29][30]. In paper [31] a content selection approach has been proposed for image description by using geometric, conceptual and visual features of image. ...