ArticlePDF Available

Automatic detection of slide transitions in lecture videos

Authors:

Abstract and Figures

This paper presents a method to automatically detect slide changes in lecture videos. For accurate detection, the regions capturing slide images are first identified from video frames. Then, SIFT features are extracted from the regions, which are invariant to image scaling and rotation. These features are used to compare similarity between frames. If the similarity is smaller than a threshold, slide transition is detected. The threshold is estimated based on the mean and standard deviation of sample frames’ similarities. Using this method, high detection accuracy can be obtained without any supplementary slide images. The proposed method also supports detection of backward slide transitions that occur when a speaker returns to a previous slide to emphasize its contents. In experiments conducted on our test collection, the proposed method showed 87 % accuracy in forward transition detection and 86 % accuracy in backward transition detection.
Content may be subject to copyright.
A preview of the PDF is not available
... H. J. Jeong et al. [9] designed a method to detect forward and backward change in lecture videos. First, the frame regions are detected to extract the SIFT features. ...
Article
Informational videos are becoming increasingly important among all video types. The users spend so much time browsing the informative videos, even if they are not interested in all their topics. Thence, a new method for extracting descriptive frames is presented in this paper that allows users to navigate directly to the topics of their interest in the video. The proposed method consists of three main phases: video preprocessing, video segmentation, and the video separation phase. Firstly, frames are extracted from the videos, resized, and converted to grayscale. Then, the frames are divided into blocks, and the kurtosis moment is calculated for each block. The videos are segmented based on an examination of the differences between the features of the kurtosis moment. Lastly, the informative frames are grouped into a separate video after they are distinguished from the uninformative ones using the clustering technique. The results demonstrated the functional effectiveness of the proposed method. According to the accuracy and F1-Score measures, it has a performance of up to 100%. Moreover, the video is significantly summarized by reducing the duration to less than 1% of its original time.
... This approach detects slide transitions when the SIFT similarity is under a defined threshold. Features extracted using the SIFT algorithm have shown good slide detection accuracy rates in [10], [22] and with slide alignment [28]. SIFT features can also be used with sparse time-varying graphs [17], where the graph models slide transitions. ...
Conference Paper
Full-text available
With the increasing number of online learning material in the web, search for specific content in lecture videos can be time consuming. Therefore, automatic slide extraction from the lecture videos can be helpful to give a brief overview of the main content and to support the students in their studies. For this task, we propose a deep learning method to detect slide transitions in lectures videos. We first process each frame of the video by a heuristic-based approach using a 2-D convolutional neural network to predict transition candidates. Then, we increase the complexity by employing two 3-D convolutional neural networks to refine the transition candidates. Evaluation results demonstrate the effectiveness of our method in finding slide transitions.
... However, the segmentation of educational videos poses a specific challenge since a topic transition does not necessarily evoke changes in shown content (e.g., in lectures) and vice versa. Attempts to solve this problem make use of speech transcripts and superimposed text (Tuna et al., 2015) or detection of slide changes (Jeong et al., 2015) to achieve a better performance. ...
Article
Full-text available
Using a Web search engine is one of today’s most frequent activities. Exploratory search activities which are carried out in order to gain knowledge are conceptualized and denoted as Search as Learning (SAL). In this paper, we introduce a novel framework model which incorporates the perspective of both psychology and computer science to describe the search as learning process by reviewing recent literature. The main entities of the model are the learner who is surrounded by a specific learning context, the interface that mediates between the learner and the information environment, the information retrieval (IR) backend which manages the processes between the interface and the set of Web resources, that is, the collective Web knowledge represented in resources of different modalities. At first, we provide an overview of the current state of the art with regard to the five main entities of our model, before we outline areas of future research to improve our understanding of search as learning processes.
... This approach detects slide transitions when the SIFT similarity is under a defined threshold. Features extracted using the SIFT algorithm have shown good slide detection accuracy rates in [10], [22] and with slide alignment [28]. SIFT features can also be used with sparse time-varying graphs [17], where the graph models slide transitions. ...
Preprint
With the increasing number of online learning material in the web, search for specific content in lecture videos can be time consuming. Therefore, automatic slide extraction from the lecture videos can be helpful to give a brief overview of the main content and to support the students in their studies. For this task, we propose a deep learning method to detect slide transitions in lectures videos. We first process each frame of the video by a heuristic-based approach using a 2-D convolutional neural network to predict transition candidates. Then, we increase the complexity by employing two 3-D convolutional neural networks to refine the transition candidates. Evaluation results demonstrate the effectiveness of our method in finding slide transitions.
... Many existing papers such as [2,3] adopted this approach to tackle the problem of slides to video matching. These approaches usually perform the comparison between every frame with every slide using SIFT [1] as shown in figure 1 and the basic idea can be summarized into the following steps: ...
Chapter
In February 2020, 500 h of video were uploaded to YouTube every minute, including numerous videos from the education sector [19, 20]. If you are looking for suitable teaching or learning videos, for example, this mass of videos makes it much more difficult to select. In the course of an implementation project with Bachelor students, teaching videos in the German language on the topic of mathematics on YouTube were examined and how they can be found for teachers and learners. From the results of this research, a database of selected videos and the corresponding search terms and difficulty levels was created in an exploratory manner. In addition, a chatbot was implemented to support users in their search for learning and teaching videos. The initial analysis showed that there are only a dozen channels that provide the majority of learning content for the subject of mathematics. The technical solution shown is able to extend this result and suggest existing but less known educational videos. The present study thus provides both a theoretical and practical complement to user-centred teaching–learning scenarios.
Article
Presentation style is an important dimension to be considered for delivering lectures or presentations. It affects the quality of the content delivery as well as the engagement of the students who consume the lectures, which is a key aspect of a learning environment. In this work, we investigate the relationship between student engagement and the presentation style stressed by the speaker in an online learning environment. For this, we proposed automatic models based on deep learning to predict the presentation style (visual or verbal or balanced) from lecture videos and the student engagement from the emotional behavior of the students. The presentation style model performed with an accuracy of 86% at the frame level and 76% at the video level. The student engagement model resulted in an accuracy of 76% and an f1-score of 0.82 at the segment level and 95% accuracy and an f1-score of 0.97 at the video level in a binary classification setting. Also, the model resulted in a mean squared error of 0.04 at the segment level and 0.15 at the video level in a regression setting. The study to investigate the relationship between presentation style and student engagement showed that there is no statistically significant difference in the mean for student engagement with the presentation styles. We found that approximately 70% of the students are engaged in the considered online learning environment, irrespective of the presentation style.
Article
In today’s world e-learning is one of the popular modes of learning and video lectures are more prominent in keeping learners engaged with course. Internet enabled to keep a large number of video lectures on-line. To search for a required topic or subtopic from this huge video repository is becoming very tedious. One way to search for a particular topic is through keyword based search and it is based on extraction of text content available in lecture video files and to achieve it one has to maintain metadata. To maintain the metadata associated with video the frames of video containing text are required to be processed. As the video contains multiple frames per second it is not required to consider each and every frame. So, the frames containing distinct content called key frames need to be identified. The identification of key frames plays crucial role in the lecture video searching process. In this paper, different techniques for key frame identification are experimentally tested and results were compared.
Chapter
In this paper, we present an approach for detecting slide transitions in lectures videos by introducing sparse time-varying graphs. Given a lecture video which records the digital slides, the speaker, and the audience by multiple cameras, our goal is to find the keyframes where slide content changes. Specifically, we first partition the lecture video into short segments through feature detection and matching. By constructing a sparse graph at each moment with short video segments as nodes, we formulate the detection problem as a graph inference issue. A set of adjacency matrix between edges, which are sparse and time-varying, are then solved through a global optimization algorithm. Consequently, the changes between adjacency matrix reflect the slide transition. Experimental results show that the proposed system achieves the better accuracy than other video summarization and slide progression detection approaches.
Article
Full-text available
Recent advancement in learning and teaching methodology experimented with virtual reality (VR)-based presentation form to create immersive learning and training environment. The quality of such educational VR applications not only relies on the virtual model, but the 2D presentation materials such as text, diagrams and figures. However, manual designing or seeking these educational resources is both labor intensive and time-consuming. In this paper, we introduce a new automatic algorithm to detect and extract presentation slides in educational videos, which will provide abundant resources for creating slide-based immersive presentation environment. The proposed approach mainly involves five core components: shot boundary detection, training instances collection, shot classification, slide region detection and slide transition detection. We conducted comparison experiment to evaluate the performance of the proposed method. The results indicate that, in comparison with peer method, the proposed method improves the precision of slide detection from 81.6 to 92.6% and recall from 74.7 to 86.3% on average. With the detected slides, content analyzer can be employed to further extract reusable elements, which can be used for developing VR-based educational applications.
Article
Full-text available
Commonly used evaluation measures including Recall, Precision, F-Measure and Rand Accuracy are biased and should not be used without clear understanding of the biases, and corresponding identification of chance or base case levels of the statistic. Using these measures a system that performs worse in the objective sense of Informedness, can appear to perform better under any of these commonly used measures. We discuss several concepts and measures that reflect the probability that prediction is informed versus chance. Informedness and introduce Markedness as a dual measure for the probability that prediction is marked versus chance. Finally we demonstrate elegant connections between the concepts of Informedness, Markedness, Correlation and Significance as well as their intuitive relationships with Recall and Precision, and outline the extension from the dichotomous case to the general multi-class case.
Conference Paper
Full-text available
We present a general approach for automatically matching electronic slides to videos of corresponding presentations for use in distance learning and video proceedings of confer- ences. We deal with a large variety of videos, various frame compositions and color balances, arbitrary slides sequence and with dynamic cameras switching, pan, tilt and zoom. To achieve high accuracy, we develop a two-phases process with unsupervised scene background modelling. In the first phase, scale invariant feature transform (SIFT) keypoints are applied to frame to slide matching, under constraint projective transformation (constraint homography) using a random sample consensus (RANSAC). Successful first-phase matches are then used to automatically build a scene back- ground model. In the second phase the background model is applied to the remaining unmatched frames to boost the matching performance for difficult cases such as wide field of view camera shots where the slide shows as a small portion of the frame. We also show that color correction is helpful when color-related similarity measures are used for identify- ing slides. We provide detailed quantitative experimentation results characterizing the effect of each part of our approach. The results show that our approach is robust and achieves high performance on matching slides to a number of videos with different styles.
Conference Paper
In this paper we propose a solution which segments lecture video by analyzing its supplementary synchronized slides. The slides content derives automatically from OCR (Optical Character Recognition) process with an approximate accuracy of 90%. Then we partition the slides into different subtopics by examining their logical relevance. Since the slides are synchronized with the video stream, the subtopics of the slides indicate exactly the segments of the video. Our evaluation reveals that the average length of segments for each lecture is ranged from 5 to 15 minutes, and 45% segments achieved from test datasets are logically reasonable.
Article
Video structuring and indexing are two crucial processes for multi-media document understanding and information retrieval. This paper presents a novel approach in automatic structuring and indexing lecture videos for an educational video system. By structuring and indexing video content, we can support both topic indexing and semantic querying of multimedia documents. In this paper, our goal is to extract indices of topics and link them with their associated video and audio segments. Two main techniques used in our proposed approach are video image analysis and video text analysis. Using this approach, we obtain accuracy of over 90.0% on our test collection.
Article
Video indexing is a central component necessary to facilitate efficient content-based retrieval andbrowsing of visual information stored in large multimedia databases. This thesis presents worktowards a unified framework for automated video indexing. To create an efficient index, a setof representative key frames are selected which capture and encapsulate the entire video content.This is achieved by, firstly, segmenting the video into its constituent shots and, secondly, selectingan...
Conference Paper
This paper proposes a new approach for shot boundary detection using information saliency. Both temporal and spatial saliency are considered to generate an information saliency map (ISM). The shot detection (both abrupt changes and gradual transitions) is then based on the change of saliency, Six publicly available video databases are used for evaluation and the results are encouraging. The overall performance of the proposed method outperforms two commercial software, namely VideoAnnex and VCM.
Conference Paper
Despite recent advances in authoring systems and tools, creating multimedia presentations remains a labor-intensive process. This paper describes a system for automatically constructing structured multimedia documents from live presentations. The automatically produced documents contain synchronized and edited audio, video, images, and text. Two essential problems, synchronization of captured data and automatic editing, are identified and solved.
Conference Paper
In this paper, we present approaches for detecting camera cuts, wipes and dissolves based on the analysis of spatio-temporal slices obtained from videos. These slices are composed of spatially and temporally coherent regions which can be perceived as shots. In the proposed methods, camera breaks are located by performing color-texture segmentation and statistical analysis on these video slices. In addition to detecting camera breaks, our methods can classify the detected breaks as camera cuts, wipes and dissolves in an efficient manner