Long short-term memory network graphical representation as depicted in the proposed model. Three gates (input gate, output gate and forget gate) control the memory cell state c

Source publication

Samples of natural scene images having textual information which could...

Proposed model for image captioning in a natural scene

Long short-term memory network graphical representation as depicted in...

Performance analysis on datasets Flickr8k and Flickr30k having...

a Examples of text detection from the Flickr30k dataset, and the...

Integration of textual cues for fine-grained image captioning using deep CNN and LSTM

Article

Full-text available

Dec 2020

The automatic narration of a natural scene is an important trait in artificial intelligence that unites computer vision and natural language processing. Caption generation is a challenging task in scene understanding. Most of the state-of-the-art methods are using deep convolutional neural network models to extract visual features of the entire ima...

Figure 9. The confusion matrices on the (a) RAVDESS and (b) EMO-DB...

Enhancing Speech Emotion Recognition Using Dual Feature Extraction Encoders

Article

Full-text available

Jul 2023

Understanding and identifying emotional cues in human speech is a crucial aspect of human-computer communication. The application of computer technology in dissecting and deciphering emotions, along with the extraction of relevant emotional characteristics from speech, forms a significant part of this process. The objective of this study was to arc...

CNN-LSTM for Heartbeat Sound Classification

Article

Full-text available

May 2024

Cardiovascular disorders are among the primary causes of death. Regularly monitoring the heart is of paramount importance in preventing fatalities arising from heart diseases. Heart disease monitoring encompasses various approaches, including the analysis of heartbeat sounds. The auditory patterns of a heartbeat can serve as indicators of heart health. This study aims to build a new model for categorizing heartbeat sounds based on associated ailments. The Phonocardiogram (PCG) method digitizes and records heartbeat sounds. By converting heartbeat sounds into digital data, researchers are empowered to develop a deep learning model capable of discerning heart defects based on distinct cardiac rhythms. This study proposes the utilization of Mel-frequency cepstral coefficients for feature extraction, leveraging their application in voice data analysis. These extracted features are subsequently employed in a multi-step classification process. The classification process merges a convolutional neural network (CNN) with a long short-term memory network (LSTM), forming a comprehensive deep learning architecture. This architecture is further enhanced through optimization utilizing the Adagrad optimizer. To examine the effectiveness of the proposed method, its classification performance is evaluated using the "Heartbeat Sounds" dataset sourced from Kaggle. Experimental results underscore the effectiveness of the proposed method by comparing it with simple CNN, CNN with vanilla LSTM, and traditional machine learning methods (MLP, SVM, Random Forest, and k-NN).

Integrating grid features and geometric coordinates for enhanced image captioning

Article

Full-text available

Dec 2023
APPL INTELL

The objective of image captioning is to provide precise descriptions of depicted objects and their relationships. To perform this task, previous studies have mainly relied on region features or a combination of these features and geometric coordinates. However, a significant limitation of these methods is their failure to incorporate grid features and their geometric coordinates, resulting in captions that inadequately identify object-related information within the global context. To overcome this limitation, we employ Swin Transformer and Deformable DETR to extract new grid and region features, along with their respective coordinates. Subsequently, we integrate the geometric coordinates of grids and regions into their corresponding features and incorporate grid features into the region features. The previously obtained features in the encoder are then used to generate text in the decoder. Through quantitative and qualitative analysis of the experimental results, our novel features and caption model have demonstrated superiority over previous methods. Specifically, our approach achieves superior inference accuracy on the COCO and Nocaps image captioning benchmarks. Compared to the baseline method, our model exhibits a 4.3% improvement, reaching a score of 136.9 on the CIDEr evaluation metric.

Study on the Applicability of LSTM for Predicting Stock Price when Facing Extreme Events

Article

Dec 2023

Dingju Dong

The analysis of stock price fluctuations holds considerable significance in the field of economics, particularly given the present environment characterized by unpredictability and rapid changes. Previously, the long short-term memory (LSTM) model has been employed effectively in addressing time series problems, including stock market forecasting. However, in the current dynamic landscape, the ability of LSTM to adapt to volatile conditions and provide accurate predictions is an area that merits further investigation. This study gathers stock data from prominent and representative companies, namely Apple, Google, Amazon, and Microsoft, spanning from January 2012 to March 2023. Specifically, two significant events are examined: the impact of the Covid-19 outbreak on the US stock market on February 26, 2020, and the Russia-Ukraine conflict occurring on February 26, 2022. By dividing the stock data surrounding these events into training and test sets, this research aims to evaluate the differential performance of LSTM in scenarios where it possesses no prior knowledge of these events versus situations where it has already assimilated the influence exerted by them.

Malware detection and classification using embedded convolutional neural network and long short-term memory technique

Article

Oct 2023

The significant growth in the use of the Internet and the rapid development of network technologies are associated with an increased risk of network attacks. As the use of encryption protocols increases, so does the challenge of identifying malware encrypted traffic also increases. Malware is a threat to people in the cyber world, as it steals personal information and harms computer systems. Network attacks refer to all types of unauthorized access to a network, including any attempts to damage and disrupt the network. This often leads to serious consequences. However, various researchers, developers and information security specialists around the globe continuously work on strategies for detecting malware. Recently, deep learning has been successfully applied to network security assessments and intrusion detection systems (IDSs) with various breakthroughs, such as using Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) to classify malicious traffic. But, with the diverse nature of malware, it is difficult to extract features from it. Therefore, existing solutions require more computing resources since available resources are not efficient for datasets with large numbers of samples. Also, adopting existing feature extractors for extracting features of images consumes more resources. This paper therefore solved these problems by combining a 1D convolutional neural network (CNN) and long short-term memory (LSTM) to adequately detect and classify malicious encrypted traffic. This work was conducted on the malware Analysis benchmark Datasets with API Call Sequences, which contains 42,797 malwares and 1,079 goodware API call sequences. The experimental results show that our proposed system has achieved 99.2% accuracy and outperformed all other state-of-the-art models.

Image caption generation using transformer learning methods: a case study on instagram image

Article

Full-text available

Oct 2023
MULTIMED TOOLS APPL

Nowadays, images are being used more extensively for communication purposes. A single image can convey a variety of stories, depending on the perspective and thoughts of everyone who views it. To facilitate comprehension, inclusion image captions is highly beneficial, especially for individuals with visual impairments who can read Braille or rely on audio descriptions. The purpose of this research is to create an automatic captioning system that is easy to understand and quick to generate. This system can be applied to other related systems. In this research, the transformer learning process is applied to image captioning instead of the convolutional neural networks (CNN) and recurrent neural networks (RNN) process which has limitations in processing long-sequence data and managing data complexity. The transformer learning process can handle these limitations well and more efficiently. Additionally, the image captioning system was trained on a dataset of 5,000 images from Instagram that were tagged with the hashtag "Phuket" (#Phuket). The researchers also wrote the captions themselves to use as a dataset for testing the image captioning system. The experiments showed that the transformer learning process can generate natural captions that are close to human language. The generated captions will also be evaluated using the Bilingual Evaluation Understudy (BLEU) score and Metric for Evaluation of Translation with Explicit Ordering (METEOR) score, a metric for measuring the similarity between machine-translated text and human-written text. This will allow us to compare the resemblance between the researcher-written captions and the transformer-generated captions.

Combining semi-supervised model and optimized LSTM for image caption generation based on pseudo labels

Article

Full-text available

Sep 2023
MULTIMED TOOLS APPL

Artificial intelligence’s crucial area of image captions. It’s a very difficult situation until the advancement of DL is made. A lot of open challenges remain as robustness, generalization and accuracy, results are far from reasonable. As image captioning schemes are data avaricious, pre-training on larger scale datasets, even if not well-curated, is fetching a solid approach. In addition to precisely identifying the image includes the scene, object, connection, and qualities of the item in the image, the image caption generation method should produce natural, fluid, precise, and useful sentences. However, since not all visual information may be utilized, it might be difficult to effectively convey the image’s content when writing image captions. Here, the image captioning is done under two models, i.e. NIC model and LSTM based model. At first, (Neural Image Caption) NIC process is done, where, CNN based caption generation is carried out for unlabelled and labeled dataset. Further, features namely, improved BOW and N-gram are derived that are used for training the CNN model. The final caption is generated by optimized LSTM, where the weights are optimally tuned by Harris Hawks with Sinusoidal Chaotic Map Assisted Exploitation (HH-SCME). Finally, BLEU score, rouge and CIDER scores are computed to prove the efficiency of HH-SCME. The proposed model of LSTM+HH-SCME achieves 0.84 BLEU score 1 value as compared to other existing methods like CNN, SSO, PRO, AOA, RNN, LSTM and LSTM+HH-SCME.

A CNN and LSTM-based Model for Creating Captions for Photos

Article

Full-text available

Jun 2023

Can a machine interpret an image's meaning with the same speed as the human brain when it is seen? This problem was heavily researched by computer vision specialists, who believed it to be unsolvable until recently. It is now possible to develop models that can generate captions for pictures because of advancements in deep learning techniques, accessibility to large datasets, and processing power. This will be accomplished by the Python-based implementation of the article's deep learning convolutional neural network technique and a particular kind of recurrent neural network. Here the proposed model uses CNN and LSTM methods to achieve desired task

Automatic image caption generation using deep learning

Article

Full-text available

Jun 2023
MULTIMED TOOLS APPL

Image captioning is an interesting and challenging task with applications in diverse domains such as image retrieval, organizing and locating images of users’ interest, etc. It has huge potential for replacing manual caption generation for images and is especially suitable for large-scale image data. Recently, deep neural network based methods have achieved great success in the field of computer vision, machine translation, and language generation. In this paper, we propose an encoder-decoder based model that is capable of generating grammatically correct captions for images. This model makes use of VGG16 Hybrid Places 1365 as an encoder and LSTM as a decoder. To ensure the complete ground truth accuracy, the model is trained on the labeled Flickr8k and MS-COCO Captions datasets., Further, the model is evaluated using all popular standard metrics such as BLEU, METEOR, GLEU, and ROUGE_L. Experimental results indicate that the proposed model obtained a BLEU-1 score of 0.6666, METEOR score of 0.5060, and GLEU score of 0.2469 on the Flickr8k dataset and BLEU-1 score 0.7350, METEOR score of 0.4768 and GLEU score 0.2798 on MS-COCO Caption dataset. Thus, the proposed method achieved a significant performance as compared to the state-of-art approaches. To evaluate the efficacy of the model further, we also show the results of caption generation from live sample images that reinforce the validity of the proposed approach.

Spatial-aware topic-driven-based image Chinese caption for disaster news

Article

Full-text available

Mar 2023
NEURAL COMPUT APPL

Automatically generating descriptions for disaster news images could effectively accelerate the spread of disaster message and lighten the burden of news editors from tedious news materials. Image caption algorithms are remarkable for generating captions directly from the content of the image. However, current image caption algorithms trained on existing image caption datasets fail to describe the disaster images with fundamental news elements. In this paper, we developed a large-scale disaster news image Chinese caption dataset (DNICC19k), which collected and annotated enormous news images related to disaster. Furthermore, we proposed a spatial-aware topic driven caption network (STCNet) to encode the interrelationships between these news objects and generate descriptive sentences related to news topics. STCNet firstly constructs a graph representation based on objects feature similarity. The graph reasoning module uses the spatial information to infer the weights of aggregated adjacent nodes according to a learnable Gaussian kernel function. Finally, the generation of news sentences are driven by the spatial-aware graph representations and the news topics distribution. Experimental results demonstrate that STCNet trained on DNICC19k could not only automatically creates descriptive sentences related to news topics for disaster news images, but also outperforms benchmark models such as Bottom-up, NIC, Show attend and AoANet on multiple evaluation metrics, achieving CIDEr/BLEU-4 scores of 60.26 and 17.01, respectively.

An Efficient Deep Learning based Hybrid Model for Image Caption Generation

Article

Full-text available

Jan 2023

Long short-term memory network graphical representation as depicted in the proposed model. Three gates (input gate, output gate and forget gate) control the memory cell state c

Similar publications

Citations