ArticlePDF Available

Robot Vision Architecture for Autonomous Clothes Manipulation

Authors:

Abstract and Figures

This paper presents a novel robot vision architecture for perceiving generic 3D clothes configurations. Our architecture is hierarchically structured, starting from low-level curvatures, across mid-level geometric shapes \& topology descriptions; and finally approaching high-level semantic surface structure descriptions. We demonstrate our robot vision architecture in a customised dual-arm industrial robot with our self-designed, off-the-self stereo vision system, carrying out autonomous grasping and dual-arm flattening. It is worth noting that the proposed dual-arm flattening approach is unique among the state-of-the-art robot autonomous system, which is the major contribution of this paper. The experimental results show that the proposed dual-arm flattening using stereo vision system remarkably outperforms the single-arm flattening and widely-cited Kinect-based sensing system for dexterous manipulation tasks. In addition, the proposed grasping approach achieves satisfactory performance on grasping various kind of garments, verifying the capability of proposed visual perception architecture to be adapted to more than one clothing manipulation tasks.
Content may be subject to copyright.
A preview of the PDF is not available
... In this work, wrinkle descriptions are used to guide a flattening-by-pulling process, in particular the main direction of the largest wrinkles. This work is continued in Sun et al. (2016a); Sun et al. (2018), where a hierarchical visual architecture is described: low-level features consisting in surface curvatures are computed from the b-spline surface fitted to the raw data, from these Shape Index features are derived as mid-level features (they capture the local topology of the surface, i.e., whether it is a cup, a trough, a saddle, a ridge, a dome, etc. up to nine features), which finally allow to detect and quantify wrinkles as high-level features (fifth-order polynomials fitted to the ridges, while ruts and domes are used for splitting wrinkles). ...
... See for example Ramisa et al. (2011), which includes an experimental study of different grasping strategies. Also Sun et al. (2018) use their hierarchical vision architecture (recall Section 2.1) to compute, from the shape index and surface topologies, adequate grasping-triples (one ridge point and two wrinkle contour points on either side of the wrinkle). For non-convex shapes of the cloth item, neither the centroid (small square) nor the center of the convex hull (blue dot) may provide suitable grasping points. ...
... As for the second type of flattening actions, combinations of longitudinal brushing over wrinkles with pinching and pulling by a single-armed robot have been addressed in Sun et al. (2015Sun et al. ( , 2016a and by bimanual manipulation Lee et al. (2015). More recently, Sun et al. (2018) do also resort to a dual-arm flattening setting, using the hierarchical visual architecture mentioned in Section 2.1. In all cases, the main issue is to detect the wrinkles in order to proceed to their removal, but other sensing tasks are needed as well: ...
Article
Full-text available
Assistive robots need to be able to perform a large number of tasks that imply some type of cloth manipulation. These tasks include domestic chores such as laundry handling or bed-making, among others, as well as dressing assistance to disabled users. Due to the deformable nature of fabrics, this manipulation requires a strong perceptual feedback. Common perceptual skills that enable robots to complete their cloth manipulation tasks are reviewed here, mainly relying on vision, but also resorting to touch and force. The use of such basic skills is then examined in the context of the different cloth manipulation tasks, be them garment-only applications in the line of performing domestic chores, or involving physical contact with a human as in dressing assistance.
... A common method of cloth unfolding is to lay the garment flat on a surface and unfold it, as in a pickand-place problem [1,2,3,4]. In [3], similar to our method, the authors present an analysis of the types of corners in order to find strategies for unfolding. ...
Preprint
Full-text available
Compared with more rigid objects, clothing items are inherently difficult for robots to recognize and manipulate. We propose a method for detecting how cloth is folded, to facilitate choosing a manipulative action that corresponds to a garment's shape and position. The proposed method involves classifying the edges and corners of a garment by distinguishing between edges formed by folds and the hem or ragged edge of the cloth. Identifying the type of edges in a corner helps to determinate how the object is folded. This bottom-up approach, together with an active perception system, allows us to select strategies for robotic manipulation. We corroborate the method using a two-armed robot to manipulate towels of different shapes, textures, and sizes.
... In [27] the authors proposed an architecture that allows a robot to grasp and manipulate clothes. This architecture uses the stereo cameras of a Kinect and combines between 2D features and 3D maps to generate a solid descriptor for the desired objects. ...
Article
Full-text available
Nowadays, robots are indispensable in industry, especially logistics industry, to replace human employees performing heavy lifting tasks. Introducing robots prevents musculoskeletal disorders that are common in ageing workforce. We designed and implemented a dual arm robot to grasp cardboard boxes of different dimensions using a hydrid force/position control. In a first step, the position of the cardboard was estimated using markers and ARtags, and an integrated camera. However this solution showed some limitation, because it is not possible to place ARtag on every cardboard of a logistic warehouse. In this paper, we propose a new method to estimate one cardboard box position based on vision without the need of markers at all. Our method explores the advantages of the RGBD integrated camera through the use of strong features and perspective geometry. Our method is very adequate to the case of one cardboard box due to the simplicity of its geometric shape. The experiments show that our method is fast, robust and precise, and of course is better suited to the logistics warehouse environment than the marker estimation procedure for palletization applications.
Article
Full-text available
In this paper a series of studies dealing with flexible material manipulation in aspects of manipulation, modelling and scheduling are discussed. The main purpose of this work is to provide an overview of the existing technologies and their capabilities both in manufacturing and academia, that can be elaborated in autonomous flexible material handling using robotics. The particularities of flexible material handling require advanced control systems for simulating, monitoring and managing the deformation of plies. A simulation model for predicting and defining the status of manipulated fabrics is proposed. Digital representation of the production system, in the basis of Digital Twin, is intended for achieving real-time adaptation. A pioneer control and planning system, interconnected to the digital model, is proposed for orchestrating the manipulation process. Current limitations of the existing technologies in flexible material handling and modelling are outlined and discussed, towards the implementation of a Workcell controller for flexible material manipulation robotic cell.
Article
Full-text available
Estimating the 6D pose of objects from images is an important problem in various applications such as robot manipulation and virtual reality. While direct regression of images to object poses has limited accuracy, matching rendered images of an object against the observed image can produce accurate results. In this work, we propose a novel deep neural network for 6D pose matching named DeepIM. Given an initial pose estimation, our network is able to iteratively refine the pose by matching the rendered image against the observed image. The network is trained to predict a relative pose transformation using an untangled representation of 3D location and 3D orientation and an iterative training process. Experiments on two commonly used benchmarks for 6D pose estimation demonstrate that DeepIM achieves large improvements over state-of-the-art methods. We furthermore show that DeepIM is able to match previously unseen objects.
Article
Full-text available
We consider the problem of detecting robotic grasps in an RGB-D view of a scene containing objects. In this work, we apply a deep learning approach to solve this problem, which avoids time-consuming hand-design of features. This presents two main challenges. First, we need to evaluate a huge number of candidate grasps. In order to make detection fast, as well as robust, we present a two-step cascaded structure with two deep networks, where the top detections from the first are re-evaluated by the second. The first network has fewer features, is faster to run, and can effectively prune out unlikely candidate grasps. The second, with more features, is slower but has to run only on the top few detections. Second, we need to handle multimodal inputs well, for which we present a method to apply structured regularization on the weights based on multimodal group regularization. We demonstrate that our method outperforms the previous state-of-the-art methods in robotic grasp detection, and can be used to successfully execute grasps on a Baxter robot.
Conference Paper
Full-text available
In this paper, we propose a single-shot approach to recognise clothing categories from 2.5D features. We propose two visual features, BSP (B-Spline Patch) and TSD (Topology Spatial Distances) for this task. The local BSP features are encoded by LLC (Locality-constrained Linear Coding) technique and fused with three different global features. Our visual feature is robust to the deformable shape of the clothing and the proposed approach is able to recognise the category of unknown clothing in unconstrained and random configurations. We integrated the category recognition pipeline with stereo vision system, clothing instance detection, and dual-arm manipulators to achieve an autonomous sorting system. To verify the performance of our proposed method, we build a high-resolution RGBD clothing dataset of 50 clothing items of 5 categories sampled in random configurations (a total of 2100 clothing samples). Experimental results show that our approach is able to reach 83.2% accuracy while classifying unseen clothing items, which advances the state-of-the-art by 36.2%. Finally, we evaluate the proposed approach in an autonomous robot sorting system, in which the robot recognises a clothing item from an unconstrained pile, grasps it, and sorts it into a box according to its category. Our proposed sorting system achieves reasonable sorting success rate with single-shot perception.
Article
Full-text available
To reduce data collection time for deep learning of robust robotic grasp plans, we explore training from a synthetic dataset of 6.7 million point clouds, grasps, and robust analytic grasp metrics generated from thousands of 3D models from Dex-Net 1.0 in randomized poses on a table. We use the resulting dataset, Dex-Net 2.0, to train a Grasp Quality Convolutional Neural Network (GQ-CNN) model that rapidly classifies grasps as robust from depth images and the position, angle, and height of the gripper above a table. Experiments with over 1,000 trials on an ABB YuMi comparing grasp planning methods on singulated objects suggest that a GQ-CNN trained with only synthetic data from Dex-Net 2.0 can be used to plan grasps in 0.8s with a success rate of 93% on eight known objects with adversarial geometry and is 3x faster than registering point clouds to a precomputed dataset of objects and indexing grasps. The GQ-CNN is also the highest performing method on a dataset of ten novel household objects, with zero false positives out of 29 grasps classified as robust and a 1.5x higher success rate than a point cloud registration method.
Article
Full-text available
This work presents a complete pipeline for folding a pile of clothes using a dual-armed robot. This is a challenging task both from the viewpoint of machine vision and robotic manipulation. The presented pipeline is comprised of the following parts: isolating and picking up a single garment from a pile of crumpled garments, recognizing its category, unfolding the garment using a series of manipulations performed in the air, placing the garment roughly flat on a work table, spreading it, and, finally, folding it in several steps. The pile is segmented into separate garments using color and texture information, and the ideal grasping point is selected based on the features computed from a depth map. The recognition and unfolding of the hanging garment are performed in an active manner, utilizing the framework of active random forests to detect grasp points, while optimizing the robot actions. The spreading procedure is based on the detection of deformations of the garment’s contour. The perception for folding employs fitting of polygonal models to the contour of the observed garment, both spread and already partially folded. We have conducted several experiments on the full pipeline producing very promising results. To our knowledge, this is the first work addressing the complete unfolding and folding pipeline on a variety of garments, including T-shirts, towels, and shorts.
Article
Full-text available
Robotic manipulation of deformable objects remains a challenging task. One such task is to iron a piece of cloth autonomously. Given a roughly flattened cloth, the goal is to have an ironing plan that can iteratively apply a regular iron to remove all the major wrinkles by a robot. We present a novel solution to analyze the cloth surface by fusing two surface scan techniques: a curvature scan and a discontinuity scan. The curvature scan can estimate the height deviation of the cloth surface, while the discontinuity scan can effectively detect sharp surface features, such as wrinkles. We use this information to detect the regions that need to be pulled and extended before ironing, and the other regions where we want to detect wrinkles and apply ironing to remove the wrinkles. We demonstrate that our hybrid scan technique is able to capture and classify wrinkles over the surface robustly. Given detected wrinkles, we enable a robot to iron them using shape features. Experimental results show that using our wrinkle analysis algorithm, our robot is able to iron the cloth surface and effectively remove the wrinkles.
Article
Estimating the 6D pose of known objects is important for robots to interact with objects in the real world. The problem is challenging due to the variety of objects as well as the complexity of the scene caused by clutter and occlusion between objects. In this work, we introduce a new Convolutional Neural Network (CNN) for 6D object pose estimation named PoseCNN. PoseCNN estimates the 3D translation of an object by localizing its center in the image and predicting its distance from the camera. The 3D rotation of the object is estimated by regressing to a quaternion representation. PoseCNN is able to handle symmetric objects and is also robust to occlusion between objects. In addition, we contribute a large scale video dataset for 6D object pose estimation named the YCB-Video dataset. Our dataset provides accurate 6D poses of 21 objects from the YCB dataset observed in 92 videos with 133,827 frames. We conduct experiments on our YCB-Video dataset and the OccludedLINEMOD dataset to show that PoseCNN provides very good estimates using only color as input.
Article
We describe a learning-based approach to hand-eye coordination for robotic grasping from monocular images. To learn hand-eye coordination for grasping, we trained a large convolutional neural network to predict the probability that task-space motion of the gripper will result in successful grasps, using only monocular camera images and independently of camera calibration or the current robot pose. This requires the network to observe the spatial relationship between the gripper and objects in the scene, thus learning hand-eye coordination. We then use this network to servo the gripper in real time to achieve successful grasps. To train our network, we collected over 800,000 grasp attempts over the course of two months, using between 6 and 14 robotic manipulators at any given time, with differences in camera placement and hardware. Our experimental evaluation demonstrates that our method achieves effective real-time control, can successfully grasp novel objects, and corrects mistakes by continuous servoing.