Conference Paper

U-Net: Convolutional Networks for Biomedical Image Segmentation

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net .

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Model Architecture We used a network architecture similar to the U-Net [93] used in Variational Diffusion Models [88]. We used the implementation publicly available in [110]. ...
... Evaluation probe Details. We used a U-Net [93] backbone with 64 output channels followed by a max pooling layer and n 1-layer MLP classifier for each of the n classes to estimate each concept of an image independently. We sample from the same data distribution but with maximal data diversity, i.e., with s i values in App. ...
... Assuming a model does not have the capability to generate images from an OOD concept class, we simply swap the embedding module for processing the conditioning vector h with that of the last checkpoint. One possible interpretation for this method is that the embedding module disentangles the concepts, i.e., generates a representation for each concept, while the U-Net [93] then utilizes such representations. This would imply that the U-Net [93] already learns how to utilize concept representations early during training, while further gradient steps lead to more robust concept representations. ...
Preprint
Modern generative models demonstrate impressive capabilities, likely stemming from an ability to identify and manipulate abstract concepts underlying their training data. However, fundamental questions remain: what determines the concepts a model learns, the order in which it learns them, and its ability to manipulate those concepts? To address these questions, we propose analyzing a model's learning dynamics via a framework we call the concept space, where each axis represents an independent concept underlying the data generating process. By characterizing learning dynamics in this space, we identify how the speed at which a concept is learned, and hence the order of concept learning, is controlled by properties of the data we term concept signal. Further, we observe moments of sudden turns in the direction of a model's learning dynamics in concept space. Surprisingly, these points precisely correspond to the emergence of hidden capabilities, i.e., where latent interventions show the model possesses the capability to manipulate a concept, but these capabilities cannot yet be elicited via naive input prompting. While our results focus on synthetically defined toy datasets, we hypothesize a general claim on emergence of hidden capabilities may hold: generative models possess latent capabilities that emerge suddenly and consistently during training, though a model might not exhibit these capabilities under naive input prompting.
... It helps clinicians to analyse subject-specific liver morphology and accurately estimate liver volume in real-time. 3D reconstruction from segmentation of 3D scans (slice based 2D image stacks) such as Computed Tomography (CT) and Magnetic Resonance Imaging (MRI) scans, although still demanding, is generally straightforward [13,20,11,9,7,2]. However, the well-known disadvantages of MRI and CT modalities-long acquisition time, cost, and the use of ionizing radiation in CT-make 3D reconstruction using US images attractive. ...
... To establish this point, we also used the standard U-Net, different variants of U-Net [13] to evaluate their performance on this segmentation task comparison with TransUNet (Table 1). In summary, our TransUNet based segmenter accurately segments the noisy, partial US scans of the liver. ...
... Out of 134 subjects, we allocated 99 for training and 35 for testing. US Liver Segmentation Results: We used FCN [10], UNet [13], UNet++ [20] with EfficientNetB7 encoder, and TransUNet [2] for segmenting US scans for US liver segmentation (Table 1). TransUNet achieved the best Accuracy (Acc.), ...
Preprint
Full-text available
3D reconstruction of the liver for volumetry is important for qualitative analysis and disease diagnosis. Liver volumetry using ultrasound (US) scans, although advantageous due to less acquisition time and safety, is challenging due to the inherent noisiness in US scans, blurry boundaries, and partial liver visibility. We address these challenges by using the segmentation masks of a few incomplete sagittal-plane US scans of the liver in conjunction with a statistical shape model (SSM) built using a set of CT scans of the liver. We compute the shape parameters needed to warp this canonical SSM to fit the US scans through a parametric regression network. The resulting 3D liver reconstruction is accurate and leads to automatic liver volume calculation. We evaluate the accuracy of the estimated liver volumes with respect to CT segmentation volumes using RMSE. Our volume computation is statistically much closer to the volume estimated using CT scans than the volume computed using Childs' method by radiologists: p-value of 0.094 (>0.05) says that there is no significant difference between CT segmentation volumes and ours in contrast to Childs' method. We validate our method using investigations (ablation studies) on the US image resolution, the number of CT scans used for SSM, the number of principal components, and the number of input US scans. To the best of our knowledge, this is the first automatic liver volumetry system using a few incomplete US scans given a set of CT scans of livers for SSM.
... It is important to note that this was performed on the basis of image patches within the overall image and not on a per pixel basis. For image segmentation, three models were trained and evaluated including U-NET (Ronneberger et al., 2015), fully convolutional network (FCN) (Long et al., 2015), and DeepLabV3 . All models were trained and evaluated on image patch sizes of 256 × 256 pixels. ...
... In this study, models representing each approach were trained and tested to evaluate performance metrics and assess suitability for the task of detecting and quantifying CRS. For classification, ResNet18 (He et al., 2016), MobileNetV3 small, custom small and large (Koonce, 2021b), and EfficientNet-B3 and EfficientNet-B4 (Koonce, 2021a) were implemented; for segmentation, U-NET (Ronneberger et al., 2015), FCN (Long et al., 2015), and DeepLabV3 were used. Each approach has advantages and disadvantages: classification models are generally faster, while segmentation models are often more accurate and have higher performance based on F1 score in particular. ...
... U-NET is a fully convolutional neural network that has an encoder-decoder structure. It was developed by Ronneberger et al. (2015) to address the problem of semantic segmentation in biomedical images and uses skip connections. The encoder transforms the input image into a latent space with lower dimension and the decoder transforms the latent space into the output image, usually with the same dimension as the input. ...
Article
Full-text available
Charcoal rot of sorghum (CRS) is a significant disease affecting sorghum crops, with limited genetic resistance available. The causative agent, Macrophomina phaseolina (Tassi) Goid, is a highly destructive fungal pathogen that targets over 500 plant species globally, including essential staple crops. Utilizing field image data for precise detection and quantification of CRS could greatly assist in the prompt identification and management of affected fields and thereby reduce yield losses. The objective of this work was to implement various machine learning algorithms to evaluate their ability to accurately detect and quantify CRS in red‐green‐blue images of sorghum plants exhibiting symptoms of infection. EfficientNet‐B3 and a fully convolutional network emerged as the top‐performing models for image classification and segmentation tasks, respectively. Among the classification models evaluated, EfficientNet‐B3 demonstrated superior performance, achieving an accuracy of 86.97%, a recall rate of 0.71, and an F1 score of 0.73. Of the segmentation models tested, FCN proved to be the most effective, exhibiting a validation accuracy of 97.76%, a recall rate of 0.68, and an F1 score of 0.66. As the size of the image patches increased, both models’ validation scores increased linearly, and their inference time decreased exponentially. This trend could be attributed to larger patches containing more information, improving model performance, and fewer patches reducing the computational load, thus decreasing inference time. The models, in addition to being immediately useful for breeders and growers of sorghum, advance the domain of automated plant phenotyping and may serve as a foundation for drone‐based or other automated field phenotyping efforts. Additionally, the models presented herein can be accessed through a web‐based application where users can easily analyze their own images.
... Among these, Swin UNETR [2,12] employs the Swin technique to construct a UNet-shaped architecture featuring a Transformer-based encoder and a convolutional-based decoder. In support of the proposed research, it is noteworthy to indicate that the UNet model and its extensions have been examined extensively for biological applications [13,14]. In basic UNet-CNN models [15][16][17], the encoder block maps the image to a lower-dimensional (latent) space, which is then reconstructed by the decoder block. ...
... The early versions of ViTs [3,22] were designed for the image classification tasks. ViT incorporates a self-attention mechanism, by means of pairwise similarities of input tokens, to capture long-range dependencies, which improved performance compared to ResNet [5] or vanilla UNet [13]. Recent studies [23,24] have extended Transformer-based architectures to enhance the performance of medical segmentation tasks. ...
Conference Paper
Full-text available
Vision Transformers have shown superior performance to the traditional convolutional-based frameworks in many vision applications, including but not limited to the segmentation of 3D medical images. To further advance this area, this study introduces the Multi-Aperture Fusion of Transformer-Convolutional Network (MFTC-Net), which integrates the output of Swin Transformers and their corresponding convolutional blocks using 3D fusion blocks. The Multi-Aperture incorporates each image patch at its original resolutions with its pyramid representation to better preserve minute details. The proposed architecture has demonstrated a score of 89.73±0.04 and 7.31±0.02 for Dice and HD95, respectively, on the Synapse multi-organs dataset-an improvement over the published results. The improved performance also comes with the added benefits of the reduced complexity of approximately 40 million parameters. Our code is available at https://github.com/Siyavashshabani/MFTC-Net
... Notably, U-Net [2] has gained prominence as a widely adopted fully CNN structure, characterized by symmetric encoders and decoders. The incorporation of skip connections in U-Net enhances the preservation of details, contributing to notable success in diverse medical image segmentation tasks. ...
... This indicates ResTrans-Unet's competitiveness over TransUNet in both overall segmentation results and organ edge predictions. Compared with eight models (U-Net [2], V-Net [31], DARR [32], Att-UNet [33], MIM [34], UCTransNet [35], TransNorm [36], and SwinUNet [9]), our method still produced competitive results. For enhanced understanding, Figure 4 presents visualizations of the results for TransUNet, SwinUNet, and our method. ...
Article
Full-text available
The convolutional neural network has significantly enhanced the efficacy of medical image segmentation. However, challenges persist in the deep learning‐based method for medical image segmentation, necessitating the resolution of the following issues: (1) Medical images, characterized by a vast spatial scale and complex structure, pose difficulties in accurate edge information extraction; (2) In the decoding process, the assumption of equal importance among different channels contradicts the reality of their varying significance. This study addresses challenges observed in earlier medical image segmentation networks, particularly focusing on the precise extraction of edge information and the inadequate consideration of inter‐channel importance during decoding. To address these challenges, we introduce ResTrans‐Unet (residual transformer medical image segmentation network), an automatic segmentation model based on Residual‐aware transformer. The Transformer is enhanced through the incorporation of ResMLP, resulting in enhanced edge information capture in images and improved network convergence speed. Additionally, Squeeze‐and‐Excitation Networks, which emphasize channel relationships, are integrated into the decoder to precisely highlight important features and suppress irrelevant ones. Experimental validations on two public datasets were carried out to assess the proposed model, comparing its performance with that of advanced models. The experimental results unequivocally demonstrate the superior performance of ResTrans‐Unet in medical image segmentation tasks.
... The hybrids of the conventional models such as AC, LSM, and superpixels and the DL show promising results. They often outperform the pure DL variants [64,65]. Consider an example of the hybrid of LSM and a popular neural network U-Net. ...
... As opposed to that, the DL learns from every segmentation. 155 US images have been coupled with the corresponding elastography from http://onlinemedicalimages.com and In addition to the three models, a basic U-net [65] along with the U-net hybrids: LS-U-net, MS-U-net and SP-U-net are tested. The models have been selected since they show the second best performance in Sect. ...
Article
Full-text available
Segmentation of tumors in the ultrasound (US) images of the breast is a critical problem in medical imaging. Due to the poor quality of the US images and varying specifications of the US machines, the segmentation and classification of the abnormalities present difficulties even for trained radiologists. Nevertheless, the US remains one of the most reliable and inexpensive tests. Recently, an artificial life (ALife) model based on tracing agents and fusion of the US and the elasticity images (F-ALife) has been proposed and analyzed. Under certain conditions, F-ALife outperforms state-of-the-art including the selected deep learning (DL) models, deformable models, machine learning, contour grouping and superpixels. Apart from the improved accuracy, F-ALife requires smaller training sets. The strongest competitors of the F-ALife are hybrids of the DL with conventional models. However, the current DL methods require a large amount of data (thousands of annotated images), which often is not available. Moreover, the hybrids require that the conventional model is properly integrated into the DL. Therefore, we offer a new DL-based hybrid with ALife. It is characterized by a high accuracy, requires a relatively small dataset, and is capable of handling previously unseen data. The new ideas include (1) a special image mask to guide ALife. The mask is generated using DL and the distance transform, (2) modification of ALife for segmentation of the US images providing a high accuracy. (These ideas are motivated by the “vehicles” of Braitenberg (Vehicles, experiments in synthetic psychology, MIT Press, Cambridge, 1984) and ALife proposed in Karunanayake et al. (Pattern Recognit 108838, 2022), (3) a two-level genetic algorithm which includes training by an individual image and by the entire set of images. The training employs an original categorization of the images based on the properties of the edge maps. The efficiency of the algorithm is demonstrated on complex tumors. The method combines the strengths of the DL neural networks with the speed and interpretability of ALife. The tests based on the characteristics of the edge map and complexity of the tumor shape show the advantages of the proposed DL-ALife. The model outperforms 14 state-of-the-art algorithms applied to the US images characterized by a complex geometry. Finally, the novel classification allows us to test and analyze the limitations of the DL for the processing of the unseen data. The code is applicable to breast cancer diagnostics (Automated Breast Ultra Sound), US-guided biopsies as well as to projects related to automatic breast scanners. A video demo is at https://tinyurl.com/3xthedff.
... The MPRNet [27] model incorporated an innovative multi-stage design [30][31][32][33], contrasting with the traditional single-stage architectures prevalent in low-level vision Fig. 1 Image deraining on the Rain100H [18], Rain100L [18], Test100 [24], Test1200 [25] and Test2800 [26] datasets. Despite having a lot fewer parameters than the underlying model, our optimised Multi-stage technique outperforms the original cutting-edge MPRNet [27][28][29] in terms of PSNR and SSIM, suggesting greater picture restoration quality from a human visual perception standpoint tasks [30,[34][35][36]. Our RainDNet model preserves this powerful multi-stage design and introduces further refinements. ...
... Subnetwork with encoder-decoder configuration As depicted in Fig. 4a, our encoder-decoder subnetwork, derived from the standard U-Net [28], includes several adjustments to accommodate our specific requirements. We mostly use channel attention blocks (CABs) [29] for collecting multiscale traits. ...
Article
Full-text available
RainDNet is an advanced image deraining model that refines the “Multi-Stage Progressive Image Restoration Network” (MPRNet) for superior computational efficiency and perceptual fidelity. RainDNet’s innovative architecture employs depthwise separable convolutions instead of MPRNet’s traditional ones, reducing model complexity and improving computational efficiency while preserving the feature extraction ability. RainDNet’s performance is enhanced by a multi-objective loss function combining perceptual loss for visual quality and Structural Similarity Index Measure (SSIM) loss for structural integrity. Experimental evaluations demonstrate RainDNet’s superior performance over MPRNet in terms of Peak Signal-to-Noise Ratio (PSNR), SSIM, and BRISQUE (Blind Referenceless Image Spatial Quality Evaluator) scores across multiple benchmark datasets, underscoring its aptitude for maintaining image fidelity while restoring structural and textural details. Our findings invite further explorations into more efficient architectures for image restoration tasks, contributing significantly to the field of computer vision. Ultimately, RainDNet lays the foundation for future, resource-efficient image restoration models capable of superior performance under diverse real-world scenarios.
... Uhl et al. (2017) and Uhl et al. (2018) used LeNet (LeCun et al. 1989) to segment individual buildings and urban areas. Heitzler and Hurni (2020) segmented individual buildings with U-Net (Ronneberger, Fischer, and Brox 2015) and automatically vectorized the building footprints afterward. However, all these approaches are susceptible to false positives of symbols with similar appearances to buildings (black spots), like borders, roads, texts, and railways (Heitzler and Hurni 2020). ...
... Proposed by Ronneberger, Fischer, and Brox (2015), U-Net is a light, effective, and widely used Encoder-Decoder structure for image segmentation. It extracts features through the encoder and restores the original resolution through the decoder with skip connections to gather information from lower layers in the Figure 2. The network structure. ...
Article
Full-text available
Historical maps are almost the exclusive source to trace back the characteristics of earth before modern earth observation techniques came into being. Processing historical maps is challenging due to the factors such as diverse designs and scales, or inherent noise from painting, aging, and scanning. Our paper is the first to leverage uncertainty estimation under the framework of Bayesian deep learning to model noise inherent in maps for semantic segmentation of hydrological features from scanned topographic historical maps. To distinguish different features with similar symbolization, we integrate atrous spatial pyramid pooling (ASPP) to incorporate multi-scale contextual information. In total, our algorithm yields predictions with an average dice coefficient of 0.827, improving the performance of a simple U-Net by 26%. Our algorithm outputs intuitively interpretable pixel-wise uncertainty maps that capture uncertainty in object boundaries, noise from drawing, aging, and scanning, as well as out-of-distribution designs. We can use the predicted uncertainty potentially to refine segmentation results, locate rare designs, and select reliable features for future GIS analyses.
... In recent years, deep learning-based convolutional neural networks (CNNs) have demonstrated remarkable performance across various domains of computer vision, achieving human-comparable accuracy in tasks such as image classification [22][23][24][25], object detection [26][27][28][29], and image segmentation [30][31][32]. Specifically in rock fragment segmentation, CNN-based methods have shown significant advantages over traditional image segmentation methods [21,[33][34][35][36]. Guo et al. [21] proposed a network for rock fragment segmentation based on multiple CNN structures. A total of 14,628 labeled image patches, extracted from the entire blast rock pile images using data augmentation methods, were used for model development. ...
... The proposed method demonstrated its excellent capability to recognize rock edge features and successfully separated coarse fragments from fine contiguous areas. Among CNN architectures, the U-Net proposed by Ronneberger et al. [30] has become the most practical method for rock fragment segmentation due to its remarkable performance. Various extended versions have been proposed and applied in rock fragmentation prediction [40][41][42][43]. ...
Article
Full-text available
Rock fragmentation is an important evaluation indicator for field blasting operations. This paper applies a deep learning-based method, the Segment Anything Model (SAM), to automatically segment rock fragments. To review the SAM’s segmentation performance, 83 images of rock fragment collected from the mine site were used as the test dataset. Pixel-level accuracy (PA), intersection over union (IOU), and dice coefficient (Dice) were employed to evaluate the model pixel-level segmentation performance. The results showed that the SAM exhibited excellent segmentation performance on the test data (PA = 94.5%, IOU = 94.4%, Dice = 95.4%). The coefficient of determination (R2) values for the 50% and 80% passing sizes (X50 and X80) were 0.970 and 0.991, respectively, which demonstrated that the SAM could achieve high precision measurement of rock fragmentation. Additionally, the effectiveness of the SAM was further evaluated by comparing it to commercial software, and the generalizability of the SAM was verified on two other datasets. The findings revealed that the SAM not only outperformed the Split-Desktop V 4.0 on the test dataset but also achieved comparable accuracy to previous studies on the two other datasets. The SAM could be regarded as a useful tool to provide fast and accurate feedback for field blasting.
... A popular DL-based registration approach is VoxelMorph (Balakrishnan et al., 2019), which is designed for brain MRI applications. It uses a U-Net-like architecture (encoder-decoder with skip-connections) (Ronneberger et al., 2015) and employs the scaling and squaring integration method on computed velocity fields to obtain diffeomorphic deformation fields. Kuang et al.(Kuang and Schmah, 2019) developed the fast image registration (FAIM) algorithm, which showed superior results to VoxelMorph. ...
Article
This study evaluates the performance of conventional SyN ANTs and learning-based reg- istration methods in the context of pediatric neuroimaging, specifically focusing on intra- subject deformable registration. The comparison involves three approaches—without (NR), with rigid (RR), and with rigid and affine (RAR) initializations. In addition to initial- ization, performances are evaluated in terms of accuracy, speed, and the impact of age intervals and sex per pair. Data consists of the publicly available MRI scans from the Calgary Preschool dataset, which includes 63 children aged 2-7 years, allowing for 431 registration pairs. We implemented the unsupervised deep learning (DL) framework with a U-Net architecture using DeepReg and it was 5-fold cross-validated. The evaluation includes Dice scores for tissue segmentation from 18 smaller regions obtained by SynthSeg, analysis of log Jacobian determinants, and registration pro-rated training and inference times. Learning-based approaches, with or without linear initializations, exhibit slight superiority over SyN ANTs in terms of Dice scores. Specifically, DL-based implementations with RR and RAR initializations significantly outperform SyN ANTs. The lower Dice scores of SyN ANTs are likely due to its lack of population-based optimization, unlike the DL methods which learn optimal parameters through training. Both SyN ANTs and DL-based registration involve parameter optimization, but the choice between these methods depends on the scale of registration—network-based for broader coverage or SyN ANTs for specific structures. Learning-based registration offers fast inference times but needs training, whereas SyN ANTs requires manual fine-tuning, with less clear guidelines, particularly for younger cohorts. Both methods face challenges with larger age intervals due to greater growth changes. Future work will extend the framework to younger populations and explore models that better separate different levels of transformations for improved local brain region registration. The main takeaway is that while DL-based methods show promise with faster and more accurate registrations, SyN ANTs remains robust and generalizable without the need for extensive training, highlighting the importance of method selection based on specific registration needs in the pediatric context. Our code is available at https://github.com/neuropoly/pediatric-DL-registration</a
... Only in recent years have studies emerged addressing this aspect. For instance, the paper [6] proposed a method for discerning real and generated images by modifying the discriminator to a U-Net [7] structure. The improved U-Net discriminator is used for pixel-wise discernment between real and generated images. ...
Preprint
Full-text available
In recent years, Generative Adversarial Networks (GANs) have demonstrated enormous promise in areas connected to image generation. As the model generation performance continues to improve and the generated images become more realistic, it is difficult to effectively distinguish between the real image and the generated image. Therefore, the problem of discriminating and optimizing the generated images (adversarial discrimination) has become necessary, and subsequent optimization plans are proposed based on the discrimination strategy. However, due to the nature of convolution, the two-dimensional power spectrum curve of the generated image is low overall; that is, compared with the real image, there is energy loss at each frequency (without other processing), and the curve drops rapidly and approaches zero, which is obviously different from the real image. In particular, the curve of the image generated by transposed convolution has a clear upward trend at the very high-frequency part, which is contrary to the characteristic of the real image, which is that the energy decreases with increasing frequency. Based on the discussion of the characteristics and inducements of the two-dimensional power spectrum curve of the generated image, we present a discrimination approach based on curve warping at high frequency and energy loss to improve the discrimination capacity of the generated image and realize the effective discrimination between the real image and the generated image. Based on this, we present the power spectrum loss function to improve the upward warping characteristics of the very high-frequency part of the two-dimensional power spectrum curve without degrading the quality of the generated image and the high-frequency feature loss function to improve the quality of the generated image. The value and efficiency of the proposed discrimination approach in this study are demonstrated on multiple GANs models, including WGAN, WGAN-GP, and SAGAN, with the dataset celeba, and the GANs model with encoder-decoder as the generator with the dataset CelebA-HQ. The two loss functions proposed are also demonstrated on multiple GANs models, including WGAN, WGAN-GP, and SAGAN with the dataset FFHQ. After adding the high-frequency feature loss, the FID decreases by 5.97, 5.15, and 6.56, respectively. After adding the power spectrum loss, the above models can improve the upward warping characteristics of the two-dimensional power spectrum curve in the very high-frequency part of the generated image to a certain extent. The FID decreases by 17.4, 11.55 and 12.27 when the weight is fixed, and 12.66, 8.15 and 4.46 when the weight is variable, respectively.
... This module enhances image realism by incorporating descriptive prompts such as 'reflect' and 'volume fog' in addition to 'rain' and 'foggy,' generating images that more closely mimic actual weather conditions. These descriptive prompts are converted into text embeddings and processed through a U-Net [47] encoder. Simultaneously, images processed by the Adaptive Image Cropping module are encoded into latent features, which are then adjusted in a hidden space to effectively simulate diverse weather conditions. ...
Preprint
Full-text available
Road damage detection involves identifying cracks, potholes, and other surface irregularities from collected images. This technology is crucial for road maintenance and ensuring traffic safety. Despite significant progress in object detection algorithms, challenges such as weather-induced variability, dispersed key features, and diverse forms of damage persist. To address these issues, this paper proposes a road damage detection algorithm named Flexi-Weather Hard Detection, which integrates data augmentation based on AIGC and corner point feature aggregation. One of the modules, named Weather Trim Augment, utilizes stable diffusion technology to generate road damage data under various weather conditions. This enhancement expands the training dataset and reduces the negative impact of weather on detection accuracy. The Flexi Corner Block Block, utilizes deformable convolutions and combines a lightweight MLP with a learnable visual center mechanism to leverage corner points, enhancing local feature learning and improving the detection of subtle and dispersed features in a multi-scale context. Additionally, the HXIOU loss function is designed, employing weighted calculations and multiple metrics to effectively mine hard examples with significant variability, thus enhancing the detection accuracy of difficult cases such as blurred potholes and fine cracks. Comprehensive experiments on the RDD2020 and CNRDD datasets demonstrate that the proposed approach significantly improves performance, achieving 64.9% in the Test1 metric and 40.6% in the F1-Score. Notably, the algorithm achieves robust detection in unannotated, adverse weather conditions such as snow and rain, showcasing excellent eneralization capabilities.
... Thus we follow the network architecture of Bauer et al. [3] for reconstruction of sparse images, and adopt FoVolNet's two-stage hybrid architecture for our model. This architecture is based on W-Net [53], and utilizes two U-Net [43] networks in sequence. The first network is used for directly filling in the incomplete image and the second for refining output to a high-quality final image. ...
Article
Full-text available
New web technologies have enabled the deployment of powerful GPU-based computational pipelines that run entirely in the web browser, opening a new frontier for accessible scientific visualization applications. However, these new capabilities do not address the memory constraints of lightweight end-user devices encountered when attempting to visualize the massive data sets produced by today's simulations and data acquisition systems. We propose a novel implicit isosurface rendering algorithm for interactive visualization of massive volumes within a small memory footprint. We achieve this by progressively traversing a wavefront of rays through the volume and decompressing blocks of the data on-demand to perform implicit ray-isosurface intersections, displaying intermediate results each pass. We improve the quality of these intermediate results using a pretrained deep neural network that reconstructs the output of early passes, allowing for interactivity with better approximates of the final image. To accelerate rendering and increase GPU utilization, we introduce speculative ray-block intersection into our algorithm, where additional blocks are traversed and intersected speculatively along rays to exploit additional parallelism in the workload. Our algorithm is able to trade-off image quality to greatly decrease rendering time for interactive rendering even on lightweight devices. Our entire pipeline is run in parallel on the GPU to leverage the parallel computing power that is available even on lightweight end-user devices. We compare our algorithm to the state of the art in low-overhead isosurface extraction and demonstrate that it achieves $1.7\times$ – $5.7\times$ reductions in memory overhead and up to $8.4\times$ reductions in data decompressed.
... One of the most used architectures, because of its efficiency with a fewer number of training images, is the so-called U-net, which has been widely used for the task of medical image segmentation [37]. It was introduced by Ronneberger et al. [38] and was initially designed for electron microscopy images, but has also been used on images acquired from different modalities [39,40]. The U-net has been proven to provide good results for segmenting eyerelated images [41,42], and considering the low computational cost and great performance with limited training data, this architecture is selected for the segmentation stage. ...
Article
Full-text available
Zebrafish (Danio rerio) eyes are widely used in modeling studies of human ophthalmic diseases, including glaucoma and myopia. These pathologies cause morphological variations in the anterior chamber elements, which can be quantitatively measured using morphometric parameters, such as the corneal curvature, central corneal thickness, and anterior chamber angle. In the present work, an automated method is presented for iris and corneal segmentation, as well as the determination of the above-mentioned morphometry from optical coherence tomography (OCT) scans of zebrafish. The proposed method consists of four stages; namely, preprocessing, segmentation, postprocessing, and extraction of morphometric parameters. The first stage is composed of a combination of wavelet and Fourier transforms as well as gamma correction for artifact removal/reduction. The segmentation step is achieved using the U-net convolutional neural network. The postprocessing stage is composed of multilevel thresholding and morphological operations. Finally, three algorithms are proposed for automated morphological extraction in the last step. The morphology obtained using our automated framework is compared against manual measurements to assess the effectiveness of the method. The obtained results show that our scheme allows reliable determination of the morphometric parameters, thereby allowing efficient assessment for massive studies on zebrafish anterior chamber morphology using OCT scans.
... Deep learning is a machine learning technique based on artificial neural networks, in which convolutional neural networks (CNN) have gradually emerged in various computer vision tasks [21,22].Compared to traditional algorithms, deep learning-based algorithms can achieve better performance and versatility in an end-to-end manner. In the field of image segmentation, many segmentation models based on convolutional networks have been proposed, mainly including fully convolutional networks (FCN) [23], U-Net [24], and generative adversarial networks (GAN) [25]. ...
Article
Full-text available
Background Glaucoma is a worldwide eye disease that can cause irreversible vision loss. Early detection of glaucoma is important to reduce vision loss, and retinal fundus image examination is one of the most commonly used solutions for glaucoma diagnosis due to its low cost. Clinically, the cup-disc ratio of fundus images is an important indicator for glaucoma diagnosis. In recent years, there have been an increasing number of algorithms for segmentation and recognition of the optic disc (OD) and optic cup (OC), but these algorithms generally have poor universality, segmentation performance, and segmentation accuracy. Methods By improving the YOLOv8 algorithm for segmentation of OD and OC. Firstly, a set of algorithms was designed to adapt the REFUGE dataset’s result images to the input format of the YOLOv8 algorithm. Secondly, in order to improve segmentation performance, the network structure of YOLOv8 was improved, including adding a ROI (Region of Interest) module, modifying the bounding box regression loss function from CIOU to Focal-EIoU. Finally, by training and testing the REFUGE dataset, the improved YOLOv8 algorithm was evaluated. Results The experimental results show that the improved YOLOv8 algorithm achieves good segmentation performance on the REFUGE dataset. In the OD and OC segmentation tests, the F1 score is 0.999. Conclusions We improved the YOLOv8 algorithm and applied the improved model to the segmentation task of OD and OC in fundus images. The results show that our improved model is far superior to the mainstream U-Net model in terms of training speed, segmentation performance, and segmentation accuracy.
... The DDLNet, our proposed network, adopts a UNet-like three-stage architecture [42] comprising six Blocks. Each Block integrates a distribution-decouple module (DDM), a Dual-Frequency Attention Mechanism (DFAM), and a ResGroup containing residual blocks. ...
Article
Full-text available
Image dehazing methods face challenges in addressing the high coupling between haze and object feature distributions in the spatial and frequency domains. This coupling often results in oversharpening, color distortion, and blurring of details during the dehazing process. To address these issues, we introduce the distribution-decouple module (DDM) and dual-frequency attention mechanism (DFAM). The DDM works effectively in the spatial domain, decoupling haze and object features through a feature decoupler and then uses a two-stream modulator to further reduce the negative impact of haze on the distribution of object features. Simultaneously, the DFAM focuses on decoupling information in the frequency domain, separating high- and low-frequency information and applying attention to different frequency components for frequency calibration. Finally, we introduce a novel dehazing network, the distribution-decouple learning network for single image dehazing with spatial and frequency decoupling (DDLNet). This network integrates DDM and DFAM, effectively addressing the issue of coupled feature distributions in both spatial and frequency domains, thereby enhancing the clarity and fidelity of the dehazed images. Extensive experiments indicate the outperformance of our DDLNet when compared to the state-of-the-art (SOTA) methods, achieving a 1.50 dB increase in PSNR on the SOTS-indoor dataset. Concomitantly, it indicates a 1.26 dB boost on the SOTS-outdoor dataset. Additionally, our method performs significantly well on the nighttime dehazing dataset NHR, achieving a 0.91 dB improvement. Code and trained models are available at https://github.com/aoe-wyb/DDLNet.
... Arshad et al. [83] introduced PLDPNet to classify potato leaf diseases leveraging mask-based segmentation techniques. They utilised U-Net [10] for segmenting ROI on potato leaves and subsequently extracted deep features using VGG19 and Inception-V3 [20] models. Then, those extracted features were combined into a latent feature vector to work as input for a ViT-based classifier. ...
Article
Full-text available
Automated early detection and classification of paddy diseases help in applying treatment efficiently according to the detected diseases. Early detection also minimises the usage of chemical substances and pesticides and hinders the spread of the disease to healthy crops. On a broader scale, it aids in halting the global spread of diseases. Thus, it ultimately promotes healthier rice crops and increased yield. In this survey paper, we present a thorough exploration of deep learning (DL) models for the classification of paddy diseases. Our paper delves into the motivation behind this research study, reveals different paddy diseases and their associated symptoms, and unravels various deep-learning models employed for disease detection. We also discuss strategies used by researchers for improving the performance of DL models, along with adaptations tailored for application-specific contexts. Additionally, we illustrate relevant research findings, explore datasets utilised in this domain, and analyse approaches for data augmentation. Through an exhaustive investigation, we emphasise existing research gaps, challenges, and open issues, concluding in a discussion on avenues for future exploration.
... A CNN is a DL algorithm specifically designed for image and video processing, making it a popular choice for medical image analysis and diagnostics. CNNs are preferred because they are robust and easy to train [21][22][23]. ...
Article
Full-text available
Artificial intelligence (AI) is a reality of our times, and it has been successfully implemented in all fields, including medicine. As a relatively new domain, all efforts are directed towards creating algorithms applicable in most medical specialties. Pathology, as one of the most important areas of interest for precision medicine, has received significant attention in the development and implementation of AI algorithms. This focus is especially important for achieving accurate diagnoses. Moreover, immunohistochemistry (IHC) serves as a complementary diagnostic tool in pathology. It can be further augmented through the application of deep learning (DL) and machine learning (ML) algorithms for assessing and analyzing immunohistochemical markers. Such advancements can aid in delineating targeted therapeutic approaches and prognostic stratification. This article explores the applications and integration of various AI software programs and platforms used in immunohistochemical analysis. It concludes by highlighting the application of these technologies to pathologies such as breast, prostate, lung, melanocytic proliferations, and hematologic conditions. Additionally, it underscores the necessity for further innovative diagnostic algorithms to assist physicians in the diagnostic process.
... The high level of representational capacity, rapid inference speed, and the ability to share filters have established CNNs as the widely accepted and commonly used method for picture segmentation. Two often employed architectures in the field are the Fully Convolutional Networks (FCNs) [13] and the U-Net [14]. While these architectures possess robust representational abilities, they depend on multi-stage cascaded CNNs when the target organs display considerable inter-patient heterogeneity regarding shape and size. ...
Article
Full-text available
Magnetic Resonance Imaging (MRI) is a medical imaging method used to visualize the brain’s anatomy, evaluate its function, and identify any abnormalities or disorders without the need for surgical intervention. Parkinson’s Disease (PD) is a condition of gradual nervous decline that effects the neurological system and the bodily functions regulated by the nerves. The impact on Quality of Life (QOL) is significant, resulting in stigma, deterioration of cognitive function, and increased limitations in mobility, including activities of daily living. Hence, early-stage diagnosis and classification of PD is crucial. This study introduces a new Deep Neural Network architecture, designed by combining the LeNet and U-Net models (LUNet) with added attention and/or residual modules for the identification of PD. The MR Images underwent pre-processing and augmentation to facilitate the precise and efficient training of Deep Learning (DL) models. The proposed model was trained using 2000 enhanced images, while validation and testing was conducted on a set of 500 untrained data. The final model is assessed using various statistical evaluation metrics and compared with LeNet-5, U-Net model along with its variants and existing works. The overall accuracies of LeNet, U-Net, and the Proposed model were 95.92 %, 97.6 %, and 99.58 % respectively.
... In order to extract the global and local information of feature maps, respectively, AMACF [39] combines the self-attention mechanism with the convolutional network. In the field of medical image segmentation, TransUNet [40] and TransBTS [41] applied Transformer and UNet [42] and produced very satisfactory segmentation results. However, their limitation is that they need a lot of training data and computational power to train and optimize the model, which means they cannot be used in environments with limited resources or volume. ...
Preprint
Full-text available
Convolutional neural networks have demonstrated efficacy in acquiring local features and spatial details; however, they struggle to obtain global information, which could potentially compromise the segmentation of important regions of an image. Transformer can increase the expressiveness of pixels by establishing global relationships between them. Moreover, some transformer-based self-attentive methods do not combine the advantages of convolution, which makes the model require more computational parameters. This work uses both Transformer and CNN structures to improve the relationship between image-level regions and global information to improve segmentation accuracy and performance in order to address these two issues and improve the semantic segmentation segmentation results at the same time. We first build a Feature Alignment Module (FAM) module to enhance spatial details and improve channel representations. Second, we compute the link between similar pixels using a Transformer structure, which enhances the pixel representation. Finally, we design a Pyramid Convolutional Pooling Module (PCPM) that both compresses and enriches the feature maps, as well as determines the global correlations among the pixels, to reduce the computational burden on the transformer. These three elements come together to form a transformer-based semantic segmentation feature fusion network (FFTNet). Our method yields 82.5% mIoU, according to experimental results based on the Cityscapes test dataset. Furthermore, we conducted various visualization tests using the Pascal VOC 2012 and Cityscapes datasets. The results show that our approach outperforms alternative approaches.
... UNet [57] is a popular convolutional network architecture that was introduced for biomedical image segmentation. The network contains two branches. ...
... In shadow detection methods, shadow regions/masks are identified [11,32,33]. For shadow detection, classifiers, regression, or segmentation networks, using U-Net-based architectures [34] or GAN-based architectures [8] are standard techniques [33,[35][36][37]. Zheng et al. [38] proposed a direction-aware shadow detection network by integrating spatial context modules in a hierarchical CNN [38]. ...
Article
Full-text available
Removing shadows in images is often a necessary pre-processing task for improving the performance of computer vision applications. Deep learning shadow removal approaches require a large-scale dataset that is challenging to gather. To address the issue of limited shadow data, we present a new and cost-effective method of synthetically generating shadows using 3D virtual primitives as occluders. We simulate the shadow generation process in a virtual environment where foreground objects are composed of mapped textures from the Places-365 dataset. We argue that complex shadow regions can be approximated by mixing primitives, analogous to how 3D models in computer graphics can be represented as triangle meshes. We use the proposed synthetic shadow removal dataset, DLSUSynthPlaces-100K, to train a feature-attention-based shadow removal network without explicit domain adaptation or style transfer strategy. The results of this study show that the trained network achieves competitive results with state-of-the-art shadow removal networks that were trained purely on typical SR datasets such as ISTD or SRD. Using a synthetic shadow dataset of only triangular prisms and spheres as occluders produces the best results. Therefore, the synthetic shadow removal dataset can be a viable alternative for future deep-learning shadow removal methods. The source code and dataset can be accessed at this link: https://neildg.github.io/SynthShadowRemoval/.
... And then LDMs craft noisy latent z t in forward phrase with q t (z t | z 0 ) = N (z t ; √ᾱ t z t−1 , (1 −ᾱ t )I), where β t growing from 0 to 1 are pre-defined values, α t = 1 − β t , andᾱ t = Π t s=1 α s . Suppose we have a textual prompt c, and a text encoder τ θ that yield the embedding c = τ θ (c), The goal of the LDM is to train a conditional noise estimator network ϵ θ , e.g., a UNet [34], by predicting the Gaussian noise added in previous timestamp, to model the conditional distribution p(z 0 |c) by gradually recovering z 0 from z T with additional textual information c. Suppose ϵ θ (z t , t, c) is the Gaussian noise estimated in the t-th step and ϵ is the grouth-truth Gaussian noise sampled for z t−1 . ...
Preprint
Personalized diffusion models have gained popularity for adapting pre-trained text-to-image models to generate images of specific topics with only a few images. However, recent studies find that these models are vulnerable to minor adversarial perturbation, and the fine-tuning performance is largely degraded on corrupted datasets. Such characteristics are further exploited to craft protective perturbation on sensitive images like portraits that prevent unauthorized generation. In response, diffusion-based purification methods have been proposed to remove these perturbations and retain generation performance. However, existing works lack detailed analysis of the fundamental shortcut learning vulnerability of personalized diffusion models and also turn to over-purifying the images cause information loss. In this paper, we take a closer look at the fine-tuning process of personalized diffusion models through the lens of shortcut learning and propose a hypothesis that could explain the underlying manipulation mechanisms of existing perturbation methods. Specifically, we find that the perturbed images are greatly shifted from their original paired prompt in the CLIP-based latent space. As a result, training with this mismatched image-prompt pair creates a construction that causes the models to dump their out-of-distribution noisy patterns to the identifier, thus causing serious performance degradation. Based on this observation, we propose a systematic approach to retain the training performance with purification that realigns the latent image and its semantic meaning and also introduces contrastive learning with a negative token to decouple the learning of wanted clean identity and the unwanted noisy pattern, that shows strong potential capacity against further adaptive perturbation.
... We added a small front-end adaptation model that fills the gaps in the input spectrum before passing it on to the backend ASR model. We used a U-net [13] architecture with skip connections from popular PLC and inpainting models. However, since our goal is to improve ASR metrics, specifically Word Error Rate (WER) and not the audio quality, instead of using perceptual losses, we utilize the gradients from the ASR model to update the adaptation models weights. ...
Preprint
Full-text available
In the realm of automatic speech recognition (ASR), robustness in noisy environments remains a significant challenge. Recent ASR models, such as Whisper, have shown promise, but their efficacy in noisy conditions can be further enhanced. This study is focused on recovering from packet loss to improve the word error rate (WER) of ASR models. We propose using a front-end adaptation network connected to a frozen ASR model. The adaptation network is trained to modify the corrupted input spectrum by minimizing the criteria of the ASR model in addition to an enhancement loss function. Our experiments demonstrate that the adaptation network, trained on Whisper's criteria, notably reduces word error rates across domains and languages in packet-loss scenarios. This improvement is achieved with minimal affect to Whisper model's foundational performance, underscoring our method's practicality and potential in enhancing ASR models in challenging acoustic environments.
... This relies on segmenting a vision-based input, such as an RGB camera, highlighting how various objects are distributed in an image. Most modern computer vision methods use convolutional neural networks (CNNs) to solve vision segmentation tasks ( [2][3][4][5][6][7]). Due to their versatility, CNNs are also the most performant models for many other vision tasks, including image recognition ( [8][9][10]), image generation ( [11][12][13][14]), and scene rendering ( [15][16][17]), among others. ...
Preprint
Full-text available
Human-robot collaboration requires the establishment of methods to guarantee the safety of participating operators. A necessary part of this process is ensuring reliable human pose estimation. Established vision-based modalities encounter problems when under conditions of occlusion. This article describes the combination of two perception modalities for pose estimation in environments containing such transient occlusion. We first introduce a vision-based pose estimation method, based on a deep Predictive Coding (PC) model featuring robustness to partial occlusion. Next, capacitive sensing hardware capable of detecting various objects is introduced. The sensor is compact enough to be mounted on the exterior of any given robotic system. The technology is particularly well-suited to detection of capacitive material, such as living tissue. Pose estimation from the two individual sensing modalities is combined using a modified Luenberger observer model. We demonstrate that the results offer better performance than either sensor alone. The efficacy of the system is demonstrated on an environment containing a robot arm and a human, showing the ability to estimate the pose of a human forearm under varying levels of occlusion.
... Stable Diffusion model operates the diffusion and denoising process in latent space rather than pixels to reduce computation cost. It adopts UNet-like [34] structure as its backbone, comprising downsampling blocks, middle block and upsampling blocks. The text guidance are encoded through CLIP [30] text encoder and integrated into the UNet through a CrossAttention block after each ResBlock [10]. ...
Preprint
Full-text available
The field of text-to-image (T2I) generation has made significant progress in recent years, largely driven by advancements in diffusion models. Linguistic control enables effective content creation, but struggles with fine-grained control over image generation. This challenge has been explored, to a great extent, by incorporating additional user-supplied spatial conditions, such as depth maps and edge maps, into pre-trained T2I models through extra encoding. However, multi-control image synthesis still faces several challenges. Specifically, current approaches are limited in handling free combinations of diverse input control signals, overlook the complex relationships among multiple spatial conditions, and often fail to maintain semantic alignment with provided textual prompts. This can lead to suboptimal user experiences. To address these challenges, we propose AnyControl, a multi-control image synthesis framework that supports arbitrary combinations of diverse control signals. AnyControl develops a novel Multi-Control Encoder that extracts a unified multi-modal embedding to guide the generation process. This approach enables a holistic understanding of user inputs, and produces high-quality, faithful results under versatile control signals, as demonstrated by extensive quantitative and qualitative evaluations. Our project page is available in \url{https://any-control.github.io}.
... Clearly, no local operation, for example, a convolution of say, 3 × 3 or even 7 × 7 can be used to move the information from the bottom left of the image to the top right. Therefore, the architecture to achieve this task requires either many convolutions layers, or, downsampling the image via pooling, where the operations are local, performing convolutions on the downsampled image, and then upsampling the image via unpooling, followed by additional convolutions to "clean" coarsening and interpolation artifacts, as is typical in UNets [39,8]. To demonstrate, we attempt to fit the data with a simple convolution residual network and with a residual network that has an advection block, as discussed in this paper. ...
Preprint
Many problems in physical sciences are characterized by the prediction of space-time sequences. Such problems range from weather prediction to the analysis of disease propagation and video prediction. Modern techniques for the solution of these problems typically combine Convolution Neural Networks (CNN) architecture with a time prediction mechanism. However, oftentimes, such approaches underperform in the long-range propagation of information and lack explainability. In this work, we introduce a physically inspired architecture for the solution of such problems. Namely, we propose to augment CNNs with advection by designing a novel semi-Lagrangian push operator. We show that the proposed operator allows for the non-local transformation of information compared with standard convolutional kernels. We then complement it with Reaction and Diffusion neural components to form a network that mimics the Reaction-Advection-Diffusion equation, in high dimensions. We demonstrate the effectiveness of our network on a number of spatio-temporal datasets that show their merit.
... PK-Diffusion. For conditional diffusion models [18,40], we designed two types based on Unet [41] and Transformer [40] architectures. In the U-Net model (PK-DIFF), the encoder extracts features by stacking multiple layers of convolutions, while the decoder reconstructs features by stacking multiple layers of convolutions. ...
Preprint
Full-text available
Data-driven generative models have emerged as promising approaches towards achieving efficient mechanical inverse design. However, due to prohibitively high cost in time and money, there is still lack of open-source and large-scale benchmarks in this field. It is mainly the case for airfoil inverse design, which requires to generate and edit diverse geometric-qualified and aerodynamic-qualified airfoils following the multimodal instructions, \emph{i.e.,} dragging points and physical parameters. This paper presents the open-source endeavors in airfoil inverse design, \emph{AFBench}, including a large-scale dataset with 200 thousand airfoils and high-quality aerodynamic and geometric labels, two novel and practical airfoil inverse design tasks, \emph{i.e.,} conditional generation on multimodal physical parameters, controllable editing, and comprehensive metrics to evaluate various existing airfoil inverse design methods. Our aim is to establish \emph{AFBench} as an ecosystem for training and evaluating airfoil inverse design methods, with a specific focus on data-driven controllable inverse design models by multimodal instructions capable of bridging the gap between ideas and execution, the academic research and industrial applications. We have provided baseline models, comprehensive experimental observations, and analysis to accelerate future research. Our baseline model is trained on an RTX 3090 GPU within 16 hours. The codebase, datasets and benchmarks will be available at \url{https://hitcslj.github.io/afbench/}.
... A forward Markov chain is defined to perturb samples drawn from the data distribution, x 0 ∼ q(x 0 ), towards a limiting isotropic Gaussian distribution. The conditional mean of the reverse-time process was derived [32,18] as the training target of a U-Net [28], which is used to propagate samples drawn from the limiting distribution back to the data distribution after training, generating novel samples. In short, the reverse diffusion process progressively transforms noise into coherent structures, while the forward diffusion process destroys the structure in the data distribution incrementally. ...
Preprint
Full-text available
This study introduces a hybrid fluid simulation approach that integrates generative diffusion models with physics-based simulations, aiming at reducing the computational costs of flow simulations while still honoring all the physical properties of interest. These simulations enhance our understanding of applications such as assessing hydrogen and CO$_2$ storage efficiency in underground reservoirs. Nevertheless, they are computationally expensive and the presence of nonunique solutions can require multiple simulations within a single geometry. To overcome the computational cost hurdle, we propose a hybrid method that couples generative diffusion models and physics-based modeling. We introduce a system to condition the diffusion model with a geometry of interest, allowing to produce variable fluid saturations in the same geometry. While training the model, we simultaneously generate initial conditions and perform physics-based simulations using these conditions. This integrated approach enables us to receive real-time feedback on a single compute node equipped with both CPUs and GPUs. By efficiently managing these processes within one compute node, we can continuously evaluate performance and stop training when the desired criteria are met. To test our model, we generate realizations in a real Berea sandstone fracture which shows that our technique is up to 4.4 times faster than commonly used flow simulation initializations.
... Candidate neurons to prune in DMs: Image denoisers in popular LDMs, such as Stable Diffusion, are characterized by the use of UNets [Ronneberger et al., 2015]. UNets consist of ResNet blocks that downsample or upsample the denoised latent space representations and transformer blocks that consist of self-attention between latent space, cross attention to incorporate textual guidance, and a Feed-forward network (FFN) with GEGLU activation function [Shazeer, 2020]. ...
Preprint
Large-scale text-to-image diffusion models excel in generating high-quality images from textual inputs, yet concerns arise as research indicates their tendency to memorize and replicate training data, raising We also addressed the issue of memorization in diffusion models, where models tend to replicate exact training samples raising copyright infringement and privacy issues. Efforts within the text-to-image community to address memorization explore causes such as data duplication, replicated captions, or trigger tokens, proposing per-prompt inference-time or training-time mitigation strategies. In this paper, we focus on the feed-forward layers and begin by contrasting neuron activations of a set of memorized and non-memorized prompts. Experiments reveal a surprising finding: many different sets of memorized prompts significantly activate a common subspace in the model, demonstrating, for the first time, that memorization in the diffusion models lies in a special subspace. Subsequently, we introduce a novel post-hoc method for editing pre-trained models, whereby memorization is mitigated through the straightforward pruning of weights in specialized subspaces, avoiding the need to disrupt the training or inference process as seen in prior research. Finally, we demonstrate the robustness of the pruned model against training data extraction attacks, thereby unveiling new avenues for a practical and one-for-all solution to memorization.
... An encoder-decoder U-Net is chosen for est due to a U-Net's desirable capabilities for image segmentation and distribution learning [17]. The architecture of est is directly adapted from an U-Net architecture used for single image material relighting by Bieron et al. [1]. ...
Preprint
We propose a material appearance modeling neural network for visualizing plausible, spatially-varying materials under diverse view and lighting conditions, utilizing only a single photograph of a material under co-located light and view as input for appearance estimation. Our neural architecture is composed of two network stages: a network that infers learned per-pixel neural parameters of a material from a single input photograph, and a network that renders the material utilizing these neural parameters, similar to a BRDF. We train our model on a set of 312,165 synthetic spatially-varying exemplars. Since our method infers learned neural parameters rather than analytical BRDF parameters, our method is capable of encoding anisotropic and global illumination (inter-pixel interaction) information into individual pixel parameters. We demonstrate our model's performance compared to prior work and demonstrate the feasibility of the render network as a BRDF by implementing it into the Mitsuba3 rendering engine. Finally, we briefly discuss the capability of neural parameters to encode global illumination information.
... Latent diffusion models generate data examples by beginning with a latent vector of noise and iteratively denoising it using a UNet [25] into a latent vector that can be decoded into a data example. ...
Preprint
We present Subtractive Training, a simple and novel method for synthesizing individual musical instrument stems given other instruments as context. This method pairs a dataset of complete music mixes with 1) a variant of the dataset lacking a specific stem, and 2) LLM-generated instructions describing how the missing stem should be reintroduced. We then fine-tune a pretrained text-to-audio diffusion model to generate the missing instrument stem, guided by both the existing stems and the text instruction. Our results demonstrate Subtractive Training's efficacy in creating authentic drum stems that seamlessly blend with the existing tracks. We also show that we can use the text instruction to control the generation of the inserted stem in terms of rhythm, dynamics, and genre, allowing us to modify the style of a single instrument in a full song while keeping the remaining instruments the same. Lastly, we extend this technique to MIDI formats, successfully generating compatible bass, drum, and guitar parts for incomplete arrangements.
... Second, scaling up the model size gives noticeable improvements, especially in audio quality whether the model was trained on the full dataset or only AC. Finally, we compare our small model against a U-Net [93] baseline at a similar computational complexity. We notice significant improvements across all metrics enabled by our scalable transformer architecture. ...
Preprint
Generating ambient sounds and effects is a challenging problem due to data scarcity and often insufficient caption quality, making it difficult to employ large-scale generative models for the task. In this work, we tackle the problem by introducing two new models. First, we propose AutoCap, a high-quality and efficient automatic audio captioning model. We show that by leveraging metadata available with the audio modality, we can substantially improve the quality of captions. AutoCap reaches CIDEr score of 83.2, marking a 3.2% improvement from the best available captioning model at four times faster inference speed. We then use AutoCap to caption clips from existing datasets, obtaining 761,000 audio clips with high-quality captions, forming the largest available audio-text dataset. Second, we propose GenAu, a scalable transformer-based audio generation architecture that we scale up to 1.25B parameters and train with our new dataset. When compared to state-of-the-art audio generators, GenAu obtains significant improvements of 15.7% in FAD score, 22.7% in IS, and 13.5% in CLAP score, indicating significantly improved quality of generated audio compared to previous works. This shows that the quality of data is often as important as its quantity. Besides, since AutoCap is fully automatic, new audio samples can be added to the training dataset, unlocking the training of even larger generative models for audio synthesis.
... Spatial discretization. When the meshes supporting the numerical simulations are regular, they can be seen as images and convolutional networks can be employed successfully (Zhu et al., 2019;Ronneberger et al., 2015;Kasim et al., 2021). For instance, U-Net architectures are employed in (Thuerey et al., 2020;Wang et al., 2020), or auto-encoders in (Kim et al., 2019). ...
Preprint
A surrogate model approximates the outputs of a solver of Partial Differential Equations (PDEs) with a low computational cost. In this article, we propose a method to build learning-based surrogates in the context of parameterized PDEs, which are PDEs that depend on a set of parameters but are also temporal and spatial processes. Our contribution is a method hybridizing the Proper Orthogonal Decomposition and several Support Vector Regression machines. This method is conceived to work in real-time, thus aimed for being used in the context of digital twins, where a user can perform an interactive analysis of results based on the proposed surrogate. We present promising results on two use cases concerning electrical machines. These use cases are not toy examples but are produced an industrial computational code, they use meshes representing non-trivial geometries and contain non-linearities.
... One way is to utilize specially designed convolutional architectures to bridge the gap from 2D to 3D and learn 3D occupancy representations. A prominent example of this approach is adopting the U-Net [70] architecture as the carrier for feature bridging. The U-Net architecture employs an encoderdecoder structure with skip connections between the upsampling and downsampling paths, preserving both low-level and high-level feature information to alleviate information loss. ...
Preprint
Full-text available
In recent years, autonomous driving has garnered escalating attention for its potential to relieve drivers' burdens and improve driving safety. Vision-based 3D occupancy prediction , which predicts the spatial occupancy status and semantics of 3D voxel grids around the autonomous vehicle from image inputs, is an emerging perception task suitable for cost-effective perception system of autonomous driving. Although numerous studies have demonstrated the greater advantages of 3D occupancy prediction over object-centric perception tasks, there is still a lack of a dedicated review fo-cusing on this rapidly developing field. In this paper, we first introduce the background of vision-based 3D occupancy prediction and discuss the challenges in this task. Secondly, we conduct a comprehensive survey of the progress in vision-based 3D occupancy prediction from three aspects: feature enhancement, deployment friendliness and label efficiency, and provide an in-depth analysis of the potentials and challenges of each category of methods. Finally, we present a summary of prevailing research trends and propose some inspiring future outlooks. To provide a valuable reference for researchers, a regularly updated collection of related papers, datasets, and codes is organized at https://github.com/ zya3d/Awesome-3D-Occupancy-Prediction.
... Deep-learning-based pixelwise segmentation methods such as U-Net and Mask Region-based Convolutional Neural Network (R-CNN) achieve state-of-the-art performance, but they require a large amount of data to annotate and train in heavy computation. (17,18) For the LED region segmentation problem, datasets are not publicly available and it is extremely difficult to collect and annotate data to train deep learning models because of the wide variety of patterns in LED lighting. In this case, traditional image segmentation methods are more realistic for application to the problem. ...
... The proposed method is compared with the existing PSA-EDUNet, EddyNet, U-Net [49] and DeepLabV3+ [50] technologies. The effectiveness of the proposed model is fully Fig.4(b). ...
Article
Full-text available
Ocean eddies have a significant impact on marine ecosystems and the climate because they transport essential substances in the ocean. Detection of ocean eddies has become one of the most active topics in physical ocean research. In recent years, research based on deep learning has mainly focused on regional oceans, with small and specific data and relatively general detection results. This study processes the global eddy by pixel-by-pixel classification and generated a global eddy classification map with a resolution of 720×1440, which expands the data volume and improved the generality of the data. Moreover, a high-precision attention residual U $^{2}$ -Net model, referred to as ARU $^{2}$ -Net, is proposed, which is suitable for mining eddy surface features from sea level anomaly(SLA) and sea surface temperature(SST) data in the global ocean. ARU $^{2}$ -Net integrates the Convolutional Block Attention Module (CBAM). The channel attention of the CBAM module is used to learn the correlation features between the SST and SLA dual channels; The spatial attention mechanism of the CBAM module is used to learn the importance of the spatial location of the eddy, focusing on the locally important regions, which further improves the detection ability of ARU $^{2}$ -Net for eddies, and helps ARU $^{2}$ -Net to better identify the eddy categories. Finally, we demonstrate the effectiveness of our approach on the global eddy dataset, achieving a test performance of 94.926%, significantly exceeding previous detection in some areas.
... proposed U-Net [4], an encoder-decoder semantic segmentation model for medical images which uses encoder to obtain spatial information and image semantics through down-sampling while decoder restores the resolution of feature maps through up-sampling and extracts image details through cross-layer fusion of feature maps, which performs well in the medical field. In 2016, Deeplab V2 [5] model was proposed based on Deeplab V1 [6] network. ...
Preprint
Full-text available
There exists several problems such as low detection efficiency and strong subjectivity in evaluation for damage of aircraft brake moving disc based on personal experiences. This paper proposes a new evaluation method based on the image segmentation. This study reviews the backgrounds and literature of related researches and analyzes the various categories and features of brake moving disc. The image data set of damage area has been semantically segmented base on U-Net model and the model for quantitative analysis of the rust area and the calculation model for maximum radial width in shedding area have also been constructed. Finally, the image data of brake moving disc on Cessna 525 has been used to validate these models. The results show that: (1) the image segmentation of different categories of brake moving discs based on U-Net model works well and the average accuracy, average recall rate, average pixel accuracy and average ratio of Intersection over Union are 91%, 91%, 90% and 87% respectively. (2) the evaluation result for damage of brake moving disc is consistent with the empirical judgement form mechanics, thus verifying that the proposed algorithm is reasonable and feasible. The research results provide a reference basis for mechanical engineers to scientifically and quantitatively determine the degree of damage to aircraft braking moving disc and replacement needs.
... Figure 4 shows (a) the CT-L3 image; (b) the pixels corresponding to the superpixels classified as SMT by the RF; (c) the coarse ground truth. U-Net is a CNN architecture proposed in [Ronneberger et al. 2015] for the specific task of medical image segmentation. The authors claim that U-Net may be trained with a relatively small set of images and still performs well. ...
Conference Paper
Estimates of the composition of skeletal muscle tissue (SMT) and adipose tissues are important in the treatment of debilitating diseases, such as cancer, and in the control of overweight and obesity. Several studies have shown a high correlation between the percentage of SMT in computed tomography (CT) images corresponding to the cross-section at the level of the third lumbar vertebra (L3) and the percentage of this tissue in the whole body. A large number of models has been proposed to automatically segment CT images in order to estimate tissue compositions, many of them use supervised Machine Learning (ML) methods, such as neural networks, which require large amounts of labeled images, i.e., images and ground truth masks obtained from manual segmentation by human experts. These large labeled datasets are not easily available to the public, thus the present work proposes a methodology capable of performing the automatic segmentation of SMT in single-slice CT images (at L3) using only “coarse” segmentation masks as ground truth in the ML algorithms’s training phases. By “coarse segmentation” we mean a semiautomated segmentation performed by a person without specialized knowledge of human anatomy. The proposed methodology oversegments the image into superpixels, which are classified by a Random Forest (RF) model. Then, a U-Net CNN refines the classification, using as input the pixels in the superpixel segments classified as SMT by the RF. The methodology achieved 99.21% of the accuracy obtained by the same CNN trained with golden standard ground truth masks, i.e., segmentation masks manually created by a medical expert.
Article
Full-text available
Context. Designing a new architecture is difficult and time-consuming process, that in some cases can be replaced by scaling existing model. In this paper we examine convolutional neural network scaling methods and aiming on the development of the method that allows to scale original network that solves segmentation task into more accurate network. Objective. The goal of the work is to develop a method of scaling a convolutional neural network, that achieve or outperform existing scaling methods, and to verify its effectiveness in solving semantic segmentation task. Method. The proposed asymmetric method combines advantages of other methods and provides same high accuracy network in the result as combined method and even outperform other methods. The method is developed to be appliable for convolutional neural networks which follows encoder-decoder architecture designed to solve semantic segmentation task. The method is enhancing feature extraction potential of the encoder part, meanwhile preserving decoder part of architecture. Because of its asymmetric nature, proposed method more efficient, since it results in smaller increase of parameters amount. Results. The proposed method was implemented on U-net architecture that was applied to solve semantic segmentation task. The evaluation of the method as well as other methods was performed on the semantic dataset. The asymmetric scaling method showed its efficiency outperformed or achieved other scaling methods results, meanwhile it has fewer parameters. Conclusions. Scaling techniques could be beneficial in cases where some extra computational resources are available. The proposed method was evaluated on the solving semantic segmentation task, on which method showed its efficiency. Even though scaling methods improves original network accuracy they highly increase network requirements, which proposed asymmetric method dedicated to decrease. The prospects for further research may include the optimization process and investigation of tradeoff between accuracy gain and resources requirements, as well as a conducting experiment that includes several different architectures.
Article
Full-text available
The efficacy of an implantable cardioverter-defibrillator (ICD) in patients with a non-ischaemic cardiomyopathy for primary prevention of sudden cardiac death is increasingly debated. We developed a multimodal deep learning model for arrhythmic risk prediction that integrated late gadolinium enhanced (LGE) cardiac magnetic resonance imaging (MRI), electrocardiography (ECG) and clinical data. Short-axis LGE-MRI scans and 12-lead ECGs were retrospectively collected from a cohort of 289 patients prior to ICD implantation, across two tertiary hospitals. A residual variational autoencoder was developed to extract physiological features from LGE-MRI and ECG, and used as inputs for a machine learning model (DEEP RISK) to predict malignant ventricular arrhythmia onset. In the validation cohort, the multimodal DEEP RISK model predicted malignant ventricular arrhythmias with an area under the receiver operating characteristic curve (AUROC) of 0.84 (95% confidence interval (CI) 0.71–0.96), a sensitivity of 0.98 (95% CI 0.75–1.00) and a specificity of 0.73 (95% CI 0.58–0.97). The models trained on individual modalities exhibited lower AUROC values compared to DEEP RISK [MRI branch: 0.80 (95% CI 0.65–0.94), ECG branch: 0.54 (95% CI 0.26–0.82), Clinical branch: 0.64 (95% CI 0.39–0.87)]. These results suggest that a multimodal model achieves high prognostic accuracy in predicting ventricular arrhythmias in a cohort of patients with non-ischaemic systolic heart failure, using data collected prior to ICD implantation.
Article
Full-text available
Currently, remote sensing techniques assist in various environmental applications and facilitate observation and spatial analysis. Machine learning algorithms allow researchers to find dependencies in satellite data and vegetation cover properties. One of the significant tasks for ecological assessment is associated with estimating forest characteristics and monitoring changes over time. In contrast to the general computer vision domain, remote sensing data and forestry measurements have their own specific requirements and necessitate tailored approaches that involve processing multispectral satellite data, creating feature spaces, and selecting training samples. In this study, we focus on extracting primary forest characteristics, including forest species groups, height, basal area, and timber stock. We utilise Sentinel-2 multispectral data to develop a machine learning-based solution for vast and remote territories. Timber stock is calculated using empirical formulas based on measurements of forest species groups, height, and basal area. These intermediate forest parameters are estimated using individually trained machine learning algorithms for each parameter. As a case study, we examine the Sakhalin region (Russia), which encompasses several forestries with varying vegetation properties. In Nevelskoye forestry, we achieved a mean absolute error (MAE) of 1.6m for height, 0.084 for basal area, and 47.8 m3/ha for timber stock. The results obtained demonstrate promise for further integrating artificial intelligencebased solutions into forestry decision-making processes and natural resources management.
Preprint
Full-text available
Cross-modality transfer aims to leverage large pretrained models to complete tasks that may not belong to the modality of pretraining data. Existing works achieve certain success in extending classical finetuning to cross-modal scenarios, yet we still lack understanding about the influence of modality gap on the transfer. In this work, a series of experiments focusing on the source representation quality during transfer are conducted, revealing the connection between larger modality gap and lesser knowledge reuse which means ineffective transfer. We then formalize the gap as the knowledge misalignment between modalities using conditional distribution P(Y|X). Towards this problem, we present Modality kNowledge Alignment (MoNA), a meta-learning approach that learns target data transformation to reduce the modality knowledge discrepancy ahead of the transfer. Experiments show that out method enables better reuse of source modality knowledge in cross-modality transfer, which leads to improvements upon existing finetuning methods.
Chapter
The task of automating the monitoring of items allowed and prohibited to be brought on board an airplane is considered. The own sample of 18,000 X-ray images was prepared in cooperation with the Ulyanovsk Civil Aviation Institute to solute this task. Classification, object detection and segmentation algorithms are investigated using modern deep learning technologies and our dataset. Modifications of convolutional neural networks as well as attention-based networks or transformer-based computer vision architectures are proposed. Special attention is given to methods for optimizing models to enable real-time performance. High Accuracy, mean Average Precision—mAP (84.5%) and Dice-Score (87.22%) quality metrics and frames per second—FPS (>20 FPS) performance metrics are obtained. Suggestions for the application of the obtained results and further research ways are formulated.
Article
Background Quantitative maps obtained with diffusion weighted (DW) imaging, such as fractional anisotropy (FA) –calculated by fitting the diffusion tensor (DT) model to the data,—are very useful to study neurological diseases. To fit this map accurately, acquisition times of the order of several minutes are needed because many noncollinear DW volumes must be acquired to reduce directional biases. Deep learning (DL) can be used to reduce acquisition times by reducing the number of DW volumes. We already developed a DL network named “one-minute FA,” which uses 10 DW volumes to obtain FA maps, maintaining the same characteristics and clinical sensitivity of the FA maps calculated with the standard method using more volumes. Recent publications have indicated that it is possible to train DL networks and obtain FA maps even with 4 DW input volumes, far less than the minimum number of directions for the mathematical estimation of the DT. Methods Here we investigated the impact of reducing the number of DW input volumes to 4 or 7, and evaluated the performance and clinical sensitivity of the corresponding DL networks trained to calculate FA, while comparing results also with those using our one-minute FA. Each network training was performed on the human connectome project open-access dataset that has a high resolution and many DW volumes, used to fit a ground truth FA. To evaluate the generalizability of each network, they were tested on two external clinical datasets, not seen during training, and acquired on different scanners with different protocols, as previously done. Results Using 4 or 7 DW volumes, it was possible to train DL networks to obtain FA maps with the same range of values as ground truth - map, only when using HCP test data; pathological sensitivity was lost when tested using the external clinical datasets: indeed in both cases, no consistent differences were found between patient groups. On the contrary, our “one-minute FA” did not suffer from the same problem. Conclusion When developing DL networks for reduced acquisition times, the ability to generalize and to generate quantitative biomarkers that provide clinical sensitivity must be addressed.
Article
Being an image-based optical technique for full-field deformation measurements, the ultimate purpose of digital image correlation (DIC) is to realize accurate, precise and pixel-wise displacement/strain measurements in a full-automatic manner without users’ inputs. In this work, we propose a task-optimized neural network, called RAFT-DIC, to achieve user-independent, accurate and pixel-wise displacement field measurements. RAFT-DIC is based on the state-of-the-art optical flow architecture: Recurrent All-Pairs Field Transforms (RAFT). We make two targeted improvements that fundamentally enhanced its measurement accuracy and generalization performance. Firstly, we remove all the down-sampling operations in the encode module to improve the perception of spatial information, and reduce the number of pyramid levels of the correlation layer to increase the small displacement accuracy. By building the correlation layer to compute the similarity of pixel pairs, and iteratively updating the displacement field through a recurrent unit, RAFT-DIC introduces the prior information of DIC measurement to guide the displacement estimation with high accuracy. Secondly, we develop a novel dataset generation method to synthesize customized speckle patterns and diverse displacement fields, which facilitate the construction of a robust and adaptable dataset to improve the network generalization. Both simulated and real experimental results demonstrate that the accuracy of the proposed method is approximately an order of magnitude higher than pervious deep learning-based DIC (DL-DIC). The proposed RAFT-DIC shows higher accuracy as well as stronger practicality and cross-dataset generalization performance over existing DL-DIC methods, and is expected to be a new standard architecture for DL-DIC.
Article
The reconstruction of object images that are located in 3D scene cross-sections using digital holography is described. The potential of generative adversarial networks for reconstructing cross-sections of 3D scenes composed of multiple layers of off-axis objects from holograms is investigated. Such scenes consist of a series of sections with objects that are not aligned with the camera’s axis. Digital holograms were used to reconstruct images of cross-sectional views of 3D scenes. It has been shown that the use of neural networks increases the speed and reconstruction quality, and reduces the image noise. A method for reconstructing images of objects using digital off-axis holograms and a generative adversarial neural network is proposed. The proposed method was tested on both numerically simulated and experimentally captured digital holograms. It was able to successfully reconstruct up to 8 cross-sections of a 3D scene from a single hologram. It was obtained that an average structural similarity index measure was equal to at least 0.73. Based on optically registered holograms, the method allowed us to reconstruct object image cross-sections of a 3D scene with a structural similarity index measure over cross-sections of a 3D scene of equal to 0.83. Therefore, the proposed technique provides the possibility for high-quality object image reconstruction and could be utilized in the analysis of micro- and macroobjects, including medical and biological applications, metrology, characterization of materials, surfaces, and volume media.
Article
Full-text available
Recognition algorithms based on convolutional networks (CNNs) typically use the output of the last layer as feature representation. However, the information in this layer may be too coarse to allow precise localization. On the contrary, earlier layers may be precise in localization but will not capture semantics. To get the best of both worlds, we define the hypercolumn at a pixel as the vector of activations of all CNN units above that pixel. Using hypercolumns as pixel descriptors, we show results on three fine-grained localization tasks: simultaneous detection and segmentation[21], where we improve state-of-the-art from 49.7[21] mean AP^r to 59.0, keypoint localization, where we get a 3.3 point boost over[19] and part labeling, where we show a 6.6 point gain over a strong baseline.
Article
Full-text available
Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models. The framework is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures. Caffe fits industry and internet-scale media needs by CUDA GPU computation, processing over 40 million images a day on a single K40 or Titan GPU (approx 2 ms per image). By separating model representation from actual implementation, Caffe allows experimentation and seamless switching among platforms for ease of development and deployment from prototyping machines to cloud environments. Caffe is maintained and developed by the Berkeley Vision and Learning Center (BVLC) with the help of an active community of contributors on GitHub. It powers ongoing research projects, large-scale industrial applications, and startup prototypes in vision, speech, and multimedia.
Conference Paper
Full-text available
Contextual information plays an important role in solving vision problems such as image segmentation. However, extracting contextual information and using it in an effective way remains a difficult problem. To address this challenge, we propose a multi-resolution contextual framework, called cascaded hierarchical model (CHM), which learns contextual information in a hierarchical framework for image segmentation. At each level of the hierarchy, a classifier is trained based on down sampled input images and outputs of previous levels. Our model then incorporates the resulting multi-resolution contextual information into a classifier to segment the input image at original resolution. We repeat this procedure by cascading the hierarchical framework to improve the segmentation accuracy. Multiple classifiers are learned in the CHM, therefore, a fast and accurate classifier is required to make the training tractable. The classifier also needs to be robust against over fitting due to the large number of parameters learned during training. We introduce a novel classification scheme, called logistic disjunctive normal networks (LDNN), which consists of one adaptive layer of feature detectors implemented by logistic sigmoid functions followed by two fixed layers of logical units that compute conjunctions and disjunctions, respectively. We demonstrate that LDNN outperforms state-of-the-art classifiers and can be used in the CHM to improve object segmentation performance.
Article
Full-text available
Motivation: Automatic tracking of cells in multidimensional timelapse fluorescence microscopy is an important task in many biomedical applications. A novel framework for objective evaluation of cell tracking algorithms has been established under the auspices of the IEEE International Symposium on Biomedical Imaging 2013 Cell Tracking Challenge. In this paper, we present the logistics, datasets, methods and results of the challenge and lay down the principles for future uses of this benchmark. Results: The main contributions of the challenge include the creation of a comprehensive video dataset repository and the definition of objective measures for comparison and ranking of the algorithms. With this benchmark, six algorithms covering a variety of segmentation and tracking paradigms have been compared and ranked based on their performance on both synthetic and real datasets. Given the diversity of the datasets, we do not declare a single winner of the challenge. Instead, we present and discuss the results for each individual dataset separately. Availability and implementation: The challenge website (http://www.codesolorzano.com/celltrackingchallenge) provides access to the training and competition datasets, along with the ground truth of the training videos. It also provides access to Windows and Linux executable files of the evaluation software and most of the algorithms that competed in the challenge. Contact: codesolorzano@unav.es Supplementary information: Supplementary data, including video samples and algorithm descriptions are available at Bioinformatics online.
Article
Full-text available
The analysis of microcircuitry (the connectivity at the level of individual neuronal processes and synapses), which is indispensable for our understanding of brain function, is based on serial transmission electron microscopy (TEM) or one of its modern variants. Due to technical limitations, most previous studies that used serial TEM recorded relatively small stacks of individual neurons. As a result, our knowledge of microcircuitry in any nervous system is very limited. We applied the software package TrakEM2 to reconstruct neuronal microcircuitry from TEM sections of a small brain, the early larval brain of Drosophila melanogaster. TrakEM2 enables us to embed the analysis of the TEM image volumes at the microcircuit level into a light microscopically derived neuro-anatomical framework, by registering confocal stacks containing sparsely labeled neural structures with the TEM image volume. We imaged two sets of serial TEM sections of the Drosophila first instar larval brain neuropile and one ventral nerve cord segment, and here report our first results pertaining to Drosophila brain microcircuitry. Terminal neurites fall into a small number of generic classes termed globular, varicose, axiform, and dendritiform. Globular and varicose neurites have large diameter segments that carry almost exclusively presynaptic sites. Dendritiform neurites are thin, highly branched processes that are almost exclusively postsynaptic. Due to the high branching density of dendritiform fibers and the fact that synapses are polyadic, neurites are highly interconnected even within small neuropile volumes. We describe the network motifs most frequently encountered in the Drosophila neuropile. Our study introduces an approach towards a comprehensive anatomical reconstruction of neuronal microcircuitry and delivers microcircuitry comparisons between vertebrate and insect neuropile.
Conference Paper
Rectified activation units (rectifiers) are essential for state-of-the-art neural networks. In this work, we study rectifier neural networks for image classification from two aspects. First, we propose a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit. PReLU improves model fitting with nearly zero extra computational cost and little overfitting risk. Second, we derive a robust initialization method that particularly considers the rectifier nonlinearities. This method enables us to train extremely deep rectified models directly from scratch and to investigate deeper or wider network architectures. Based on our PReLU networks (PReLU-nets), we achieve 4.94% top-5 test error on the ImageNet 2012 classification dataset. This is a 26% relative improvement over the ILSVRC 2014 winner (GoogLeNet, 6.66%). To our knowledge, our result is the first to surpass human-level performance (5.1%, Russakovsky et al.) on this visual recognition challenge.
Article
Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, exceed the state-of-the-art in semantic segmentation. Our key insight is to build "fully convolutional" networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet, the VGG net, and GoogLeNet) into fully convolutional networks and transfer their learned representations by fine-tuning to the segmentation task. We then define a novel architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves state-of-the-art segmentation of PASCAL VOC (20% relative improvement to 62.2% mean IU on 2012), NYUDv2, and SIFT Flow, while inference takes one third of a second for a typical image.
Article
Rectified activation units (rectifiers) are essential for state-of-the-art neural networks. In this work, we study rectifier neural networks for image classification from two aspects. First, we propose a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit. PReLU improves model fitting with nearly zero extra computational cost and little overfitting risk. Second, we derive a robust initialization method that particularly considers the rectifier nonlinearities. This method enables us to train extremely deep rectified models directly from scratch and to investigate deeper or wider network architectures. Based on our PReLU networks (PReLU-nets), we achieve 4.94% top-5 test error on the ImageNet 2012 classification dataset. This is a 26% relative improvement over the ILSVRC 2014 winner (GoogLeNet, 6.66%). To our knowledge, our result is the first to surpass human-level performance (5.1%, Russakovsky et al.) on this visual recognition challenge.
Article
We address a central problem of neuroanatomy, namely, the automatic segmen-tation of neuronal structures depicted in stacks of electron microscopy (EM) im-ages. This is necessary to efficiently map 3D brain structure and connectivity. To segment biological neuron membranes, we use a special type of deep artificial neural network as a pixel classifier. The label of each pixel (membrane or non-membrane) is predicted from raw pixel values in a square window centered on it. The input layer maps each window pixel to a neuron. It is followed by a succes-sion of convolutional and max-pooling layers which preserve 2D information and extract features with increasing levels of abstraction. The output layer produces a calibrated probability for each class. The classifier is trained by plain gradient descent on a 512 × 512 × 30 stack with known ground truth, and tested on a stack of the same size (ground truth unknown to the authors) by the organizers of the ISBI 2012 EM Segmentation Challenge. Even without problem-specific post-processing, our approach outperforms competing techniques by a large margin in all three considered metrics, i.e. rand error, warping error and pixel error. For pixel error, our approach is the only one outperforming a second human observer.
Article
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.
Article
The ability of learning networks to generalize can be greatly enhanced by providing constraints from the task domain. This paper demonstrates how such constraints can be integrated into a backpropagation network through the architecture of the network. This approach has been successfully applied to the recognition of handwritten zip code digits provided by the U.S. Postal Service. A single network learns the entire recognition operation, going from the normalized image of the character to the final classification.
Discriminative unsupervised feature learning with convolutional neural networks
  • A Dosovitskiy
  • J T Springenberg
  • M Riedmiller
  • T Brox
Deep neural networks segment neuronal membranes in electron microscopy images
  • D C Ciresan
  • L M Gambardella
  • A Giusti
  • J Schmidhuber