MULTIMODAL HYBRID DEEP LEARNING TECHNIQUES FOR FEATURE EXTRACTION FROM VIDEO FOR VIDEO CLASSIFICATION



Authors

  • *Priyanka Panchal1, Dr. Dinesh J. Prajapati 2

DOI:

https://doi.org/10.15282/jmes.17.1.2023.10.0759


Keywords:

Convolutional neural networks, Deep Learning, LSTM, Video Classification, Keyframe extraction


Abstract

Videos have very rich semantic content. Nowadays, video classification has gained much attention as a result of the exponential growth in video content. Conventional handcrafted features are limited in their ability to analyze complex video semantics. Nowadays, there has been significant growth in the development of deep neural networks (DNNs) for video analytics. This is due in part to the success story of deep learning techniques during the analysis of the sequence of frame i.e. image, audio, and text data. One of the most practical applications of DNNs for video analytics is categorizing videos into their main Semantic labels, such as "skiing". One of the critical steps in video classification is feature extraction, which aims to capture essential information from videos based on that video classification is to occur. The proposed approach combines the powerful benefits of both Convolutional Neural Networks (CNNs) followed by temporal difference networks (TDN), networks to feature extraction the video frames and temporal sequences. In particular, first apply a pre-trained CNN model on video for frame-level features extraction, after that these features are incorporated into an temporal difference network (TDN), to capture the temporal dynamics on Benchmark datasets UCF101 were used to evaluate proposed method and evaluate it against several state-of-the-art methods. This paper presents a novel approach for video classification and the proposed model deep learning model that combines the strengths of two different deep learning techniques that can take advantage of a wide variety of multimodal information. The proposed method concatenates extracted features, spatial and temporal using a CNN and a TDN, respectively. The features that have been extracted are then load to a LSTM classifier which result is used for video classification. The outcome of the experiment shows that evaluating the efficacy of our proposed method, achieving higher classification accuracy, and outperforming other methods in most cases.



Published

2024-06-14

How to Cite