Figure - uploaded by Vít Růžička
Content may be subject to copyright.
Example of a crowded 4K video frame annotated
with our method.

Example of a crowded 4K video frame annotated with our method.

Source publication
Article
Full-text available
Machine learning has celebrated a lot of achievements on computer vision tasks such as object detection, but the traditionally used models work with relatively low resolution images. The resolution of recording devices is gradually increasing and there is a rising need for new methods of processing high resolution data. We propose an attention pipe...

Contexts in source publication

Context 1
... there are advantages in how much information we can extract from higher resolution images. For example in Figure 1 we can detect more human figures in the original resolution as compared to resizing the image to the lower resolution of the models. With the limitations of current models, we came up with two baseline approaches. ...
Context 2
... Figure 10 we compare the FPS performance of our at- tention pipeline model with the all crops baseline approach. We note that on an average video from the PEViD dataset our method achieves average performance of 5-6 fps. ...
Context 3
... inspecting the detailed decomposition of opera- tions performed in each frame in Figure 11, we can see that the final evaluation in often not the most time consuming step. We need to consider client side operations and the transfer time between one client and many used servers. ...
Context 4
... the case of 8K videos, the I/O time of opening and saving an image becomes a concern as well even as it is performed on another thread. Finally, we have explored the influence of number of used server for attention precomputation stage and for the final evaluation stage in Figure 12. We can see, that there is a moment of saturation in scaling the number of workers. ...

Similar publications

Article
Full-text available
Visual object detection is a computer vision-based artificial intelligence (AI) technique which has many practical applications (e.g., fire hazard monitoring). However, due to privacy concerns and the high cost of transmitting video data, it is highly challenging to build object detection models on centrally stored large training datasets following...
Preprint
Full-text available
In industry deep learning application, our manually labeled data has a certain number of noisy data. To solve this problem and achieve more than 90 score in dev dataset, we present a simple method to find the noisy data and re-label the noisy data by human, given the model predictions as references in human labeling. In this paper, we illustrate ou...
Conference Paper
Full-text available
As the use of Deep Neural Networks (DNNs) becomes pervasive, their vulnerability to adversarial attacks and limitations in handling unseen classes poses significant challenges. The state-of-the-art offers discrete solutions aimed to tackle individual issues covering specific adversarial attack scenarios, classification or evolving learning. However...
Preprint
Full-text available
Multitask learning is a common approach in machine learning, which allows to train multiple objectives with a shared architecture. It has been shown that by training multiple tasks together inference time and compute resources can be saved, while the objectives performance remains on a similar or even higher level. However, in perception related mu...
Preprint
Full-text available
Machine learning is being widely adapted in industrial applications owing to the capabilities of commercially available hardware and rapidly advancing research. Volkswagen Financial Services (VWFS), as a market leader in vehicle leasing services, aims to leverage existing proprietary data and the latest research to enhance existing and derive new b...

Citations

... E.g., additional quality factors such as deformation could be added. Also, the detection speed could be increased even further through additional pre-processing improvements such as image scaling (Růžička and Franchetti 2018), the modification of models through, e.g., layer reduction (van Rijthoven et al 2018) or the usage of even more lightweight models (Adarsh et al 2020;Womg et al 2018). On top of that, additional models, both traditional, such as background reduction focused (Haque et al 2008), as well as deep learning-based ones, e.g. ...
Article
Full-text available
Reducing waste through automated quality control (AQC) has both positive economical and ecological effects. In order to incorporate AQC in packaging, multiple quality factor types (visual, informational, etc.) of a packaged artifact need to be evaluated. Thus, this work proposes an end-to-end quality control framework evaluating multiple quality control factors of packaged artifacts (visual, informational, etc.) to enable future industrial and scientific use cases. The framework includes an AQC architecture blueprint as well as a computer vision-based model training pipeline. The framework is designed generically, and then implemented based on a real use case from the packaging industry. As an innovate approach to quality control solution development, the data-centric artificial-intelligence (DCAI) paradigm is incorporated in the framework. The implemented use case solution is finally tested on actual data. As a result, it is shown that the framework’s implementation through a real industry use case works seamlessly and achieves superior results. The majority of packaged artifacts are correctly classified with rapid prediction speed. Deep-learning-based and traditional computer vision approaches are both integrated and benchmarked against each other. Through the measurement of a variety of performance metrics, valuable insights and key learnings for future adoptions of the framework are derived.
... To overcome this problem, many research works have extensively explored the tiling system images (Ozge Unel et al. 2019;Yang et al. 2022;Ružička and Franchetti 2018;Plastiras et al. 2018). In the same vein, we propose to split the image into tiles of fixed sizes. ...
Article
Full-text available
The study of macroinvertebrates using computer vision is in its infancy and still faces multiple challenges including destructive sampling, low signal-to-noise ratios, and the complexity to choose a model algorithm among multiple existing ones. In order to deal with those challenges, we propose here a new framework, dubbed 'MacroNet,’ for the monitoring, i.e., detection and identification at the morphospecies level, of live aquatic macroinvertebrates. This framework is based on an enhanced RetinaNet model. Pre-processing steps are suggested to enhance the characterization propriety of the original algorithm. The images are split into fixed-size tiles to better detect and identify small macroinvertebrates. The tiles are then fed as an input to the model, and the resulting bounding box is assembled. We have optimized the anchor boxes generation process for high detection performance using the k-medoid algorithm. In order to enhance the localization accuracy of the original RetinaNet model, the complete intersection over union loss has been integrated as a regression loss to replace the standard loss (a smooth l1 norm). Experimental results show that MacroNet outperforms the original RetinaNet model on our database and can achieve on average 74.93% average precision (AP), depending on the taxon identity. In our database, taxa were identified at various taxonomic levels, from species to order. Overall, the proposed framework offers promising results for the non-lethal and cost-efficient monitoring of live freshwater macroinvertebrates.
... Large objects in the input image can be detected, but small objects are difficult to detect because the characteristic parts for identifying the objects are also shrunk. Dividing the input image into several parts of a limited size can also be done to prevent shrinkage of the characteristic parts [21,22,23,24,25,26], but this means large objects that straddle the divided images cannot be detected because the characteristic parts are also divided. As another approach, a coarse-to-fine-based inference scheme for object detection has been proposed [27,28]. ...
Article
To detect a wide range of objects with one camera at once, real-time object detection in high-definition video is required in video artificial intelligence (AI) applications for edge/terminal, such as beyond-visual-line-of-sight (BVLOS) drone flight. Although various AI inference schemes for object detection (e.g., you-only-look-once (YOLO)) have been proposed, they typically have limitations on the input image size and thus need to shrink the input high-definition image down to the limit. This makes small objects collapsed and undetectable. This paper presents our proposal technology for solving this problem and its effective implementation, where multiple object detectors cooperate to detect small and large objects in high-definition video such as full HD and 4K.
... Thus, most deep learning-based studies for computer vision use resizing methods, such as downscaling, as preprocessing to reduce the input size of the models. However, simple downscaling [13] has a problem in that it can cause information loss when detecting the objects. Especially when the objects are much smaller than the images that contain them, information loss is more remarkable, and this causes the detection performance to decline [14]. ...
... According to the findings of studies related to deep learning models, detecting small objects, such as cracks, in UHR images may degrade the efficiency and performance of deep learning models. Růžička et al. [13] proposed an attention pipeline method that uses a twostage evaluation of each image or video frame under rough and refined resolution to limit the total number of necessary evaluations. They highlighted that the downscaling of UHR images degraded the detection performance and adopted a method of dividing images to address this problem. ...
... There are several preprocessing methods for using UHR images as input for DCNN (n convolutional neural Network)-based complex detection models, including resizing the image itself [13,14,24]. In this study, a method of splitting UHR images into patches of appropriate size was used to minimize information loss. ...
Article
Full-text available
This study proposes a defect detection framework to improve the performance of deep learning-based detection models for ultra-high resolution (UHR) images generated by tunnel inspection systems. Most of the scanning technologies used in tunnel inspection systems generate UHR images. Defects in real-world images, on the other hand, are noticeably smaller than the image. These characteristics make simple preprocessing applications, such as downscaling, difficult due to information loss. Additionally, when a deep learning model is trained by the UHR images under the limited computational resource for training, problems may occur, including a reduction in object detection rate, unstable training, etc. To address these problems, we propose a framework that includes preprocessing and postprocessing of UHR images related to image patches rather than focusing on deep learning models. Furthermore, it includes a method for supplementing problems according to the format of the data annotation in the preprocessing process. When the proposed framework was applied to the UHR images of a tunnel, the performance of the deep learning-based defect detection model was improved by approximately 77.19 percentage points (pp). Because the proposed framework is for general UHR images, it can effectively recognize damage to general structures other than tunnels. Thus, it is necessary to verify the applicability of the defect detection framework under various conditions in future works.
... 608×608 px. For example, in the work [4] a two-stage pedestrian detection system was proposed using a YOLOv2 neural network. Two square images of 2160 × 2160 pixels were cropped from the 4K input image and then scaled to the required dimensions of the network input, i.e. 608 × 608 pixels. ...
Preprint
Full-text available
Object detection is an essential component of many vision systems. For example, pedestrian detection is used in advanced driver assistance systems (ADAS) and advanced video surveillance systems (AVSS). Currently, most detectors use deep convolutional neural networks (e.g., the YOLO - You Only Look Once --family), which, however, due to their high computational complexity, are not able to process a very high-resolution video stream in real-time, especially within a limited energy budget. In this paper we present a hardware implementation of the well-known pedestrian detector with HOG (Histogram of Oriented Gradients) feature extraction and SVM (Support Vector Machine) classification. Our system running on AMD Xilinx Zynq UltraScale+ MPSoC (Multiprocessor System on Chip) device allows real-time processing of 4K resolution (UHD - Ultra High Definition, 3840 x 2160 pixels) video for 60 frames per second. The system is capable of detecting a pedestrian in a single scale. The results obtained confirm the high suitability of reprogrammable devices in the real-time implementation of embedded vision systems.
... 608×608 px. For example, in the work [4] a two-stage pedestrian detection system was proposed using a YOLOv2 neural network. Two square images of 2160 × 2160 pixels were cropped from the 4K input image and then scaled to the required dimensions of the network input, i.e. 608 × 608 pixels. ...
Preprint
Full-text available
Object detection is an essential component of many vision systems. For example, pedestrian detection is used in advanced driver assistance systems (ADAS) and advanced video surveillance systems (AVSS). Currently, most detectors use deep convolutional neural networks (e.g., the YOLO -- You Only Look Once -- family), which, however, due to their high computational complexity, are not able to process a very high-resolution video stream in real-time, especially within a limited energy budget. In this paper we present a hardware implementation of the well-known pedestrian detector with HOG (Histogram of Oriented Gradients) feature extraction and SVM (Support Vector Machine) classification. Our system running on AMD Xilinx Zynq UltraScale+ MPSoC (Multiprocessor System on Chip) device allows real-time processing of 4K resolution (UHD -- Ultra High Definition, 3840 x 2160 pixels) video for 60 frames per second. The system is capable of detecting a pedestrian in a single scale. The results obtained confirm the high suitability of reprogrammable devices in the real-time implementation of embedded vision systems.
... Large objects in the input image can be detected, but small objects are difficult to detect because the characteristic parts for identifying objects are also shrunk. Dividing the input image into a limited size can also be considered to prevent shrinking the characteristic parts [21][22][23][24][25][26], but this means large objects that straddle the divided images cannot be detected because the characteristic parts are also divided. In other words, the conventional approaches are unsuitable for object detection in highdefinition images. ...
Article
Video artificial intelligence (AI) applications for edges/terminals, such as non-visual drone flight, require object detection in high-definition video in order to detect a wide range of objects with one camera at once. To enable satisfying this requirement, we propose a new high-definition object detection technology based on an AI inference scheme and its implementation. In the technology, multiple object detectors cooperate to detect small and large objects in high-definition video. The evaluation results show that our technology can achieve 2.1 times higher detection performance in full HD images thanks to the cooperation of three object detectors.
... YOLO is a single shot multibox detector, meaning it uses a single CNN to generate and classify the bounding boxes for a given image. The specific implementation developed for the competition was based on the work by Růžička and Franchetti [18]. Instead of a single CNN, a region proposal CNN is used together with a traditional YOLO network structure to reduce input dimensionality as seen in Fig. 11 and 12. ...
Conference Paper
A suite of solutions was developed by the University of Cincinnati Aerial Vehicles (UCAV) team to address the challenges presented by the 2020 AUVSI SUAS Competition. Competition tasks are reflective of current topics in Unmanned Aerial System (UAS) research including autonomous flight, object detection classification and localization (ODLC), obstacle avoidance, coverage path planning (CPP), and aerial payload delivery. A custom designed, autonomous hexacopter Unmanned Aerial Vehicle (UAV) named Xelaya was developed, having a gross takeoff weight (GTOW) of 22kg and an endurance of more than 30 minutes, allowing for the transport of additional vehicle subsystems. A second vehicle, a custom autonomous Unmanned Ground Vehicle (UGV), was manufactured and tested to be integrated into the UAV platform for the delivery objective. A modular approach to software design was used, taking advantage of the features of Robot Operating System (ROS) for managing data flow and handling a distributed workload across multiple systems and vehicles. Both an autonomous and manual system were implemented for ODLC. The autonomous system implements a custom convolutional neural network (CNN), while the manual system is composed of two web-based graphical user interfaces (GUIs) for operator input. For obstacle avoidance, a geometry-based method is compared to a node-based A* algorithm approach in order to find the more effective way to minimize both travel distance and execution time. Several methods typically used for solving NP-hard problems, including a genetic algorithm, 2-opt heuristic, and nearest neighbor are investigated for their application to a CPP problem through the competition’s search area.