Keyframe Extraction in Endoscopic Video

Over the last years surgeons adopted to record and archive videos of endoscopic surgeries. There are several reasons to use endoscopic videos after the procedure: to train young surgeons, for documentation, to show patients what has been done, for comparative analysis of procedures, etc. In most of these use cases not the entire video, which can be several hours long, is relevant but rather a summary of the video or certain interesting segments from it. Since manual processing of these videos is tedious, automatic techniques of abstracting and indexing these long-duration videos are needed. One important task in this context is the extraction of a set of representative frames, so called keyframes, from the long-duration video.

Apart from endoscopic videos, keyframe extraction is important for other types of videos as well, such as movies, documentaries, sport events, or surveillance. Thus, many solutions have been proposed in literature [1]. For videos that are composed of different shots, where a shot is defined as a sequence of frames shot uninterruptedly by the same camera, keyframe detection works quite well. For instance, one method is to compare the color histograms of two adjacent frames to detect a significant change in the image sequence. Such a considerable change represents a shot boundary and a video summary can be created by extracting frames from each detected shot.

Our goal for keyframe extraction for endoscopic videos is to provide a coarse-grained segmentation in order to select only a few representative frames for each video, with low redundancy (i.e., no similar key frames) and good quality (i.e., no blurred frames). The extracted frames allow a user to quickly decide whether the underlying video is interesting when browsing the archive and to navigate to a specific video segment in the underlying recording of the surgery (e.g., to show this segment to a patient). To achieve these goals, we developed a keyframe extraction method that is adapted to the aforementioned characteristics of endoscopic videos [2].

Our keyframe extraction method works as follows. To find candidate keyframes we compare local features of frames, extracted by the ORB (Oriented FAST and Rotated BRIEF) keypoint descriptor [3]. ORB is well suited for our purpose, since it is very robust to image noise and can be performed in real-time. The ORB feature points primarily locate on edges of anatomic structures and describe the environment of pixels in a binary and highly efficient form. Using these ORB descriptors, we match temporally close frames in order to determine frames of significant content change, which are considered as candidate keyframes. These candidate keyframes are further inspected and frames that are similar to already selected keyframes, or frames that are blurry are filtered out. The blurry detection is performed by computing the Difference-of-Gaussians for a candidate frame and discarding frames that have a low difference. Additionally, we also apply a circle detection method to exclude the potential black border and detect and ignore out of patient scenes. Thus, our frame extraction method keeps the number of representative frames as low as possible.


The difference between two frames is determined by calculating the average distance between all the best matching ORB keypoints of these two frames. The lower the distance measure between two frames is, the more similar they are. We use an adaptive threshold to decide whether or not two frames are similar. In particular, a keyframe is selected only if the current frame distance significantly differs from the average frame distance observed within a certain time window. This ensures that only frames containing completely different keypoints than previous frames are selected as candidate frames. Candidate keyframes are also compared to already selected keyframes to ignore similar keyframes selected in a later portion of the video. The following pictures visualize the feature point matching frames with similar and with different content.

We performed a comparative evaluation through a user study with a surgeon and a group of multimedia experts with experience in medical video analysis. Evaluation results show that our approach outperforms other keyframe extraction methods such as uniform sampling or random sampling and also a color-based (HSV) method [4] that targets medical videos. Nevertheless, the rating of the medical expert shows that improvements are still required to provide a better overview of an endoscopic video. In particular, keyframes should show semantically important scenes of the surgery (e.g., showing the anatomy, the pathology and how it was treated). Therefore, in our future work we will focus on the integration of knowledge from domain experts to further improve the performance of our keyframe extraction approach. Furthermore, we are also looking into adapting such a method over time by considering relevance feedback.


  1. Truong, B.T., Venkatesh, S.: “Video abstraction: A systematic review and classification“. ACM Trans. Multimedia Comput. Commun. Appl., vol. 3(1), pp. 1-37, 2007.
  2. Schoeffmann, K., Del Fabro, M., Szkaliczki T., Böszörmenyi, L., Keckstein J., “Keyframe extraction in endoscopic video“, Multimedia Tools and Applications, vol. 74(24), pp. 11187-11206, 2015.
  3. Rublee, E., Rabaud, V., Konolige, K., Bradski, G.: ORB: “An efficient alternative to SIFT or SURF”. Proc. of IEEE Int. Conf. Computer Vision (ICCV), pp. 2564–2571, 2011.
  4. Mendi, E., Bayrak, C., Cecen, S., Ermisoglu, E.: “Content-based management service for medical videos“. Telemedicine and e-Health, vol 19(1), 2013.