ArticlePDF Available

GrabCut: Interactive Foreground Extraction Using Iterated Graph Cuts

August 2004
ACM Transactions on Graphics 23(3):309-314

August 2004
23(3):309-314

DOI:10.1145/1186562.1015720

Source
DBLP

Authors:

Carsten Rother

Technische Universität Dresden

Andrew Blake

University of Oxford

The problem of efficient, interactive foreground/background segmentation in still images is of great practical importance in image editing. Classical image segmentation tools use either texture (colour) information, e.g. Magic Wand, or edge (contrast) information, e.g. Intelligent Scissors. Recently, an approach based on optimization by graph-cut has been developed which successfully combines both types of information. In this paper we extend the graph-cut approach in three respects. First, we have developed a more powerful, iterative version of the optimisation. Secondly, the power of the iterative algorithm is used to simplify substantially the user interaction needed for a given quality of result. Thirdly, a robust algorithm for "border matting" has been developed to estimate simultaneously the alpha-matte around an object boundary and the colours of foreground pixels. We show that for moderately difficult examples the proposed method outperforms competitive tools.

Content uploaded by Carsten Rother

Content may be subject to copyright.

“GrabCut” — Interactive Foreground Extraction using Iterated Graph Cuts

Carsten Rother∗Vladimir Kolmogorov†

Microsoft Research Cambridge, UK Andrew Blake‡

Figure 1: Three examples of GrabCut . The user drags a rectangle loosely around an object. The object is then extracted automatically.

Abstract

The problem of efﬁcient, interactive foreground/background seg-

mentation in still images is of great practical importance in im-

age editing. Classical image segmentation tools use either texture

(colour) information, e.g. Magic Wand, or edge (contrast) infor-

mation, e.g. Intelligent Scissors. Recently, an approach based on

optimization by graph-cut has been developed which successfully

combines both types of information. In this paper we extend the

graph-cut approach in three respects. First, we have developed a

more powerful, iterative version of the optimisation. Secondly, the

power of the iterative algorithm is used to simplify substantially the

user interaction needed for a given quality of result. Thirdly, a ro-

bust algorithm for “border matting” has been developed to estimate

simultaneously the alpha-matte around an object boundary and the

colours of foreground pixels. We show that for moderately difﬁcult

examples the proposed method outperforms competitive tools.

CR Categories: I.3.3 [Computer Graphics]: Picture/Image

Generation—Display algorithms; I.3.6 [Computer Graphics]:

Methodology and Techniques—Interaction techniques; I.4.6 [Im-

age Processing and Computer Vision]: Segmentation—Pixel clas-

siﬁcation; partitioning

Keywords: Interactive Image Segmentation, Graph Cuts, Image

Editing, Foreground extraction, Alpha Matting

1 Introduction

This paper addresses the problem of efﬁcient, interactive extrac-

tion of a foreground object in a complex environment whose back-

ground cannot be trivially subtracted. The resulting foreground ob-

ject is an alpha-matte which reﬂects the proportion of foreground

and background. The aim is to achieve high performance at the

cost of only modest interactive effort on the part of the user. High

performance in this task includes: accurate segmentation of object

from background; subjectively convincing alpha values, in response

to blur, mixed pixels and transparency; clean foreground colour,

∗e-mail: carrot@microsoft.com

†e-mail: vnk@microsoft.com

‡e-mail: ablake@microsoft.com

free of colour bleeding from the source background. In general,

degrees of interactive effort range from editing individual pixels, at

the labour-intensive extreme, to merely touching foreground and/or

background in a few locations.

1.1 Previous approaches to interactive matting

In the following we describe brieﬂy and compare several state of

the art interactive tools for segmentation: Magic Wand, Intelligent

Scissors, Graph Cut and Level Sets and for matting: Bayes Matting

and Knockout. Fig. 2 shows their results on a matting task, together

with degree of user interaction required to achieve those results.

Magic Wand starts with a user-speciﬁed point or region to com-

pute a region of connected pixels such that all the selected pixels

fall within some adjustable tolerance of the colour statistics of the

speciﬁed region. While the user interface is straightforward, ﬁnding

the correct tolerance level is often cumbersome and sometimes im-

possible. Fig. 2a shows the result using Magic Wand from Adobe

Photoshop 7 [Adobe Systems Incorp. 2002]. Because the distri-

bution in colour space of foreground and background pixels have a

considerable overlap, a satisfactory segmentation is not achieved.

Intelligent Scissors (a.k.a. Live Wire or Magnetic Lasso)

[Mortensen and Barrett 1995] allows a user to choose a “minimum

cost contour” by roughly tracing the object’s boundary with the

mouse. As the mouse moves, the minimum cost path from the cur-

sor position back to the last “seed” point is shown. If the computed

path deviates from the desired one, additional user-speciﬁed “seed”

points are necessary. In ﬁg. 2b the Magnetic Lasso of Photoshop 7

was used. The main limitation of this tool is apparent: for highly

texture (or un-textured) regions many alternative “minimal” paths

exist. Therefore many user interactions (here 19) were necessary to

obtain a satisfactory result. Snakes or Active Contours are a related

approach for automatic reﬁnement of a lasso [Kass et al. 1987].

Bayes matting models colour distributions probabilistically to

achieve full alpha mattes [Chuang et al. 2001] which is based on

[Ruzon and Tomasi 2000]. The user speciﬁes a “trimap” T=

{TB,TU,TF}in which background and foreground regions TBand

TFare marked, and alpha values are computed over the remain-

ing region TU. High quality mattes can often be obtained (ﬁg.

2c), but only when the TUregion is not too large and the back-

ground/foreground colour distributions are sufﬁciently well sepa-

rated. A considerable degree of user interaction is required to con-

struct an internal and an external path.

Knockout 2 [Corel Corporation 2002] is a proprietary plug-in for

Photoshop which is driven from a user-deﬁned trimap, like Bayes

matting, and its results are sometimes similar (ﬁg. 2d), sometimes

of less quality according to [Chuang et al. 2001].

Graph Cut [Boykov and Jolly 2001; Greig et al. 1989] is a pow-

erful optimisation technique that can be used in a setting similar

to Bayes Matting, including trimaps and probabilistic colour mod-

els, to achieve robust segmentation even in camouﬂage, when fore-

ground and background colour distributions are not well separated.

The system is explained in detail in section 2. Graph Cut techniques

can also be used for image synthesis, like in [Kwatra et al. 2003]

where a cut corresponds to the optimal smooth seam between two

images, e.g. source and target image.

Level sets [Caselles et al. 1995] is a standard approach to image

and texture segmentation. It is a method for front propagation by

solving a corresponding partial differential equation, and is often

used as an energy minimization tool. Its advantage is that almost

any energy can be used. However, it computes only a local mini-

mum which may depend on initialization. Therefore, in cases where

the energy function can be minimized exactly via graph cuts, the

latter method should be preferable. One such case was identiﬁed

by [Boykov and Kolmogorov 2003] for computing geodesics and

minimal surfaces in Riemannian space.

1.2 Proposed system: GrabCut

Ideally, a matting tool should be able to produce continuous alpha

values over the entire inference region TUof the trimap, without

any hard constraint that alpha values may only be 0 or 1. In that

way, problems involving smoke, hair, trees etc., could be dealt with

appropriately in an automatic way. However, in our experience,

techniques designed for solving that general matting problem [Ru-

zon and Tomasi 2000; Chuang et al. 2001] are effective when there

is sufﬁcient separation of foreground and background color distri-

butions but tend to fail in camouﬂage. Indeed it may even be that the

general matting problem is not solvable in camouﬂage, in the sense

that humans would ﬁnd it hard to perceive the full matte. This mo-

tivates our study of a somewhat less ambitious but more achievable

form of the problem.

First we obtain a “hard” segmentation (sections 2 and 3) using

iterative graph cut . This is followed by border matting (section 4)

in which alpha values are computed in a narrow strip around the

hard segmentation boundary. Finally, full transparency, other than

at the border, is not dealt with by GrabCut. It could be achieved

however using the matting brush of [Chuang et al. 2001] and, in

our experience, this works well in areas that are sufﬁciently free of

camouﬂage.

The novelty of our approach lies ﬁrst in the handling of segmen-

tation. We have made two enhancements to the graph cuts mech-

anism: “iterative estimation” and “incomplete labelling” which to-

gether allow a considerably reduced degree of user interaction for a

given quality of result (ﬁg. 2f). This allows GrabCut to put a light

load on the user, whose interaction consists simply of dragging a

rectangle around the desired object. In doing so, the user is indi-

cating a region of background, and is free of any need to mark a

foreground region. Secondly we have developed a new mechanism

for alpha computation, used for border matting, whereby alpha val-

ues are regularised to reduce visible artefacts.

2 Image segmentation by graph cut

First, the segmentation approach of Boykov and Jolly , the founda-

tion on which GrabCut is built, is described in some detail.

2.1 Image segmentation

Their paper [Boykov and Jolly 2001] addresses the segmentation

of a monochrome image, given an initial trimap T. The image is

an array z= (z1,...,zn,...,zN)of grey values, indexed by the (sin-

gle) index n. The segmentation of the image is expressed as an

array of “opacity” values

= (

1,...,

N)at each pixel. Gener-

ally 0 ≤

n≤1, but for hard segmentation

n∈ {0,1}, with 0 for

background and 1 for foreground. The parameters

describe image

foreground and background grey-level distributions, and consist of

histograms of grey values:

={h(z;

=0,1},(1)

one for background and one for foreground. The histograms are

assembled directly from labelled pixels from the respective trimap

regions TB,TF. (Histograms are normalised to sum to 1 over the

grey-level range: Rzh(z;

) = 1.)

The segmentation task is to infer the unknown opacity variables

from the given image data zand the model

2.2 Segmentation by energy minimisation

An energy function Eis deﬁned so that its minimum should cor-

respond to a good segmentation, in the sense that it is guided both

by the observed foreground and background grey-level histograms

and that the opacity is “coherent”, reﬂecting a tendency to solidity

of objects. This is captured by a “Gibbs” energy of the form:

,z) = U(

,z) + V(

,z).(2)

The data term Uevaluates the ﬁt of the opacity distribution

to the

data z, given the histogram model

, and is deﬁned to be:

,z) = ∑

n−logh(zn;

n).(3)

The smoothness term can be written as

,z) =

∑

(m,n)∈C

dis(m,n)−1[

n6=

m]exp−

(zm−zn)2,(4)

where [

]denotes the indicator function taking values 0,1 for a

predicate

,Cis the set of pairs of neighboring pixels, and where

dis(·)is the Euclidean distance of neighbouring pixels. This energy

encourages coherence in regions of similar grey-level. In practice,

good results are obtained by deﬁning pixels to be neighbours if they

are adjacent either horizontally/vertically or diagonally (8-way con-

nectivity). When the constant

=0, the smoothness term is simply

the well-known Ising prior, encouraging smoothness everywhere, to

a degree determined by the constant

. It has been shown however

[Boykov and Jolly 2001] that it is far more effective to set

>0 as

this relaxes the tendency to smoothness in regions of high contrast.

The constant

is chosen [Boykov and Jolly 2001] to be:

=2D(zm−zn)2E−1,(5)

where h·idenotes expectation over an image sample. This choice

ensures that the exponential term in (4) switches appropriately

between high and low contrast. The constant

was obtained as 50

by optimizing performance against ground truth over a training set

of 15 images. It proved to be a versatile setting for a wide variety

of images (see [Blake et al. 2004]).

Now that the energy model is fully deﬁned, the segmentation can

be estimated as a global minimum:

=argmin

).(6)

Minimisation is done using a standard minimum cut algorithm

[Boykov and Jolly 2001; Kolmogorov and Zabih 2002]. This al-

gorithm forms the foundation for hard segmentation, and the next

section outlines three developments which have led to the new hard

segmentation algorithm within GrabCut. First, the monochrome

image model is replaced for colour by a Gaussian Mixture Model

(GMM) in place of histograms. Secondly, the one-shot minimum

cut estimation algorithm is replaced by a more powerful, iterative

procedure that alternates between estimation and parameter learn-

ing. Thirdly, the demands on the interactive user are relaxed by

allowing incomplete labelling — the user speciﬁes only TBfor the

trimap, and this can be done simply by placing a rectangle or a lasso

around the object.

Magic Wand Intelligent Scissors Bayes Matte Knockout 2 Graph cut GrabCut

(a) (b) (c) (d) (e) (f)

Figure 2: Comparison of some matting and segmentation tools. The top row shows the user interaction required to complete the segmenta-

tion or matting process: white brush/lasso (foreground), red brush/lasso (background), yellow crosses (boundary). The bottom row illustrates

the resulting segmentation. GrabCut appears to outperform the other approaches both in terms of the simplicity of user input and the quality

of results. Original images on the top row are displayed with reduced intensity to facilitate overlay; see ﬁg. 1. for original. Note that our

implementation of Graph Cut [Boykov and Jolly 2001] uses colour mixture models instead of grey value histograms.

3 The GrabCut segmentation algorithm

This section describes the novel parts of the GrabCuthard segmen-

tation algorithm: iterative estimation and incomplete labelling.

3.1 Colour data modelling

The image is now taken to consist of pixels znin RGB colour space.

As it is impractical to construct adequate colour space histograms,

we follow a practice that is already used for soft segmentation [Ru-

zon and Tomasi 2000; Chuang et al. 2001] and use GMMs. Each

GMM, one for the background and one for the foreground, is taken

to be a full-covariance Gaussian mixture with Kcomponents (typi-

cally K=5). In order to deal with the GMM tractably, in the opti-

mization framework, an additional vector k={k1,...,kn,...,kN}is

introduced, with kn∈ {1,...K}, assigning, to each pixel, a unique

GMM component, one component either from the background or

the foreground model, according as

n=0 or 11.

The Gibbs energy (2) for segmentation now becomes

,k,

,z) = U(

,k,

,z) + V(

,z),(7)

depending also on the GMM component variables k. The data term

Uis now deﬁned, taking account of the colour GMM models, as

,k,

,z) = ∑

nD(

n,kn,

,zn),(8)

where D(

n,kn,

,zn) = −log p(zn|

n,kn,

)−log

(

n,kn), and

p(·)is a Gaussian probability distribution, and

(·)are mixture

weighting coefﬁcients, so that (up to a constant):

n,kn,

,zn) = −log

(

n,kn) + 1

2logdetΣ(

n,kn)(9)

2[zn−

(

n,kn)]>Σ(

n,kn)−1[zn−

(

n,kn)].

Therefore, the parameters of the model are now

(

,k),

(

,k),Σ(

,k),

=0,1,k=1...K},(10)

1Soft assignments of probabilities for each component to a given pixel

might seem preferable as it would allow “Expectation Maximization”

[Dempster et al. 1977] to be used; however that involves signiﬁcant addi-

tional computational expense for what turns out to be negligible practical

beneﬁt.

i.e. the weights

, means

and covariances Σof the 2KGaussian

components for the background and foreground distributions. The

smoothness term Vis basically unchanged from the monochrome

case (4) except that the contrast term is computed using Euclidean

distance in colour space:

,z) =

∑

(m,n)∈C

[

n6=

m]exp−

kzm−znk2.(11)

3.2 Segmentation by iterative energy minimization

The new energy minimization scheme in GrabCutworks iteratively,

in place of the previous one-shot algorithm [Boykov and Jolly

2001]. This has the advantage of allowing automatic reﬁnement

of the opacities

, as newly labelled pixels from the TUregion of

the initial trimap are used to reﬁne the colour GMM parameters

. The main elements of the GrabCut system are given in ﬁg. 3.

Step 1 is straightforward, done by simple enumeration of the kn

values for each pixel n. Step 2 is implemented as a set of Gaussian

parameter estimation procedures, as follows. For a given GMM

component kin, say, the foreground model, the subset of pixels

F(k) = {zn:kn=kand

n=1}is deﬁned. The mean

(

,k)and

covariance Σ(

,k)are estimated in standard fashion as the sam-

ple mean and covariance of pixel values in F(k)and weights are

(

,k) = |F(k)|/∑k|F(k)|, where |S|denotes the size of a set S.

Finally step 3 is a global optimization, using minimum cut, exactly

as [Boykov and Jolly 2001].

The structure of the algorithm guarantees proper convergence

properties. This is because each of steps 1 to 3 of iterative min-

imisation can be shown to be a minimisation of the total energy E

with respect to the three sets of variables k,

in turn. Hence

Edecreases monotonically, and this is illustrated in practice in ﬁg.

4. Thus the algorithm is guaranteed to converge at least to a local

minimum of E. It is straightforward to detect when Eceases to

decrease signiﬁcantly, and to terminate iteration automatically.

Practical beneﬁts of iterative minimisation. Figs. 2e and 2f il-

lustrate how the additional power of iterative minimisation in Grab-

Cut can reduce considerably the amount of user interaction needed

to complete a segmentation task, relative to the one-shot graph cut

[Boykov and Jolly 2001] approach. This is apparent in two ways.

First the degree of user editing required, after initialisation and opti-

misation, is reduced. Second, the initial interaction can be simpler,

for example by allowing incomplete labelling by the user, as de-

scribed below.

Initialisation

•User initialises trimap Tby supplying only TB. The fore-

ground is set to TF=/0; TU=TB, complement of the back-

ground.

•Initialise

n=0 for n∈TBand

n=1 for n∈TU.

•Background and foreground GMMs initialised from sets

n=0 and

n=1 respectively.

Iterative minimisation

1. Assign GMM components to pixels: for each nin TU,

kn:=argmin

Dn(

n,kn,

,zn).

2. Learn GMM parameters from data z:

:=argmin

,k,

,z)

3. Estimate segmentation: use min cut to solve:

min

{

n:n∈TU}min

kE(

,k,

,z).

4. Repeat from step 1, until convergence.

5. Apply border matting (section 4).

User editing

•Edit: ﬁx some pixels either to

n=0 (background brush)

n=1 (foreground brush); update trimap Taccord-

ingly. Perform step 3 above, just once.

•Reﬁne operation: [optional] perform entire iterative min-

imisation algorithm.

Figure 3: Iterative image segmentation in GrabCut

1 4 8 12

Energy E

RED

GREEN

RED

GREEN

(a) (b) (c)

Figure 4: Convergence of iterative minimization for the data of

ﬁg. 2f. (a) The energy Efor the llama example converges over

12 iterations. The GMM in RGB colour space (side-view showing

R,G) at initialization (b) and after convergence (c). K=5 mixture

components were used for both background (red) and foreground

(blue). Initially (b) both GMMs overlap considerably, but are bet-

ter separated after convergence (c), as the foreground/background

labelling has become accurate.

3.3 User Interaction and incomplete trimaps

Incomplete trimaps. The iterative minimisation algorithm al-

lows increased versatility of user interaction. In particular, incom-

plete labelling becomes feasible where, in place of the full trimap

T, the user needs only specify, say, the background region TB, leav-

ing TF=0. No hard foreground labelling is done at all. Iterative

minimisation (ﬁg. 3) deals with this incompleteness by allowing

provisional labels on some pixels (in the foreground) which can

subsequently be retracted; only the background labels TBare taken

to be ﬁrm — guaranteed not to be retracted later. (Of course a com-

plementary scheme, with ﬁrm labels for the foreground only, is also

a possibility.) In our implementation, the initial TBis determined by

the user as a strip of pixels around the outside of the marked rect-

angle (marked in red in ﬁg. 2f).

Automatic

Segmentation

Automatic

Segmentation

U

s

e

r

I

n

t

e

r

a

c

t

i

o

n

Figure 5: User editing. After the initial user interaction and seg-

mentation (top row), further user edits (ﬁg. 3) are necessary. Mark-

ing roughly with a foreground brush (white) and a background

brush (red) is sufﬁcient to obtain the desired result (bottom row).

Further user editing. The initial, incomplete user-labelling is of-

ten sufﬁcient to allow the entire segmentation to be completed au-

tomatically, but by no means always. If not, further user editing

is needed [Boykov and Jolly 2001], as shown in ﬁg.5. It takes the

form of brushing pixels, constraining them either to be ﬁrm fore-

ground or ﬁrm background; then the minimisation step 3. in ﬁg. 3

is applied. Note that it is sufﬁcient to brush, roughly, just part of a

wrongly labeled area. In addition, the optional “reﬁne” operation of

ﬁg. 3 updates the colour models, following user edits. This prop-

agates the effect of edit operations which is frequently beneﬁcial.

Note that for efﬁciency the optimal ﬂow, computed by Graph Cut,

can be re-used during user edits.

4 Transparency

Given that a matting tool should be able to produce continuous al-

pha values, we now describe a mechanism by which hard segmenta-

tion, as described above, can be augmented by “border matting”, in

which full transparency is allowed in a narrow strip around the hard

segmentation boundary. This is sufﬁcient to deal with the problem

of matting in the presence of blur and mixed pixels along smooth

object boundaries. The technical issues are: Estimating an alpha-

map for the strip without generating artefacts, and recovering the

foreground colour, free of colour bleeding from the background.

4.1 Border Matting

Border matting begins with a closed contour C, obtained by ﬁtting a

polyline to the segmentation boundary from the iterative hard seg-

mentation of the previous section. A new trimap {TB,TU,TF}is

computed, in which TUis the set of pixels in a ribbon of width ±w

pixels either side of C(we use w=6). The goal is to compute the

map

n,n∈TU, and in order to do this robustly, a strong model

is assumed for the shape of the

-proﬁle within TU. The form of

the model is based on [Mortensen and Barrett 1999] but with two

important additions: regularisation to enhance the quality of the es-

timated

-map; and a dynamic programming (DP) algorithm for

estimating

throughout TU.

Let t=1,...,Tbe a parameterization of contour C, periodic with

period T, as curve Cis closed. An index t(n)is assigned to each

pixel n∈TU, as in ﬁg. 6(c). The

-proﬁle is taken to be a soft step-

function g (ﬁg. 6c):

n=grn;∆t(n),

t(n), where rnis a signed

distance from pixel nto contour C. Parameters ∆,

determine the

centre and width respectively of the transition from 0 to 1 in the

r w

n n

α σ

∆

−w r

(a) (b) (c)

Figure 6: Border matting. (a) Original image with trimap over-

laid. (b) Notation for contour parameterisation and distance map.

Contour C(yellow) is obtained from hard segmentation. Each pixel

in TUis assigned values (integer) of contour parameter tand dis-

tance rnfrom C. Pixels shown share the same value of t. (c) Soft

step-function for

-proﬁle g, with centre ∆and width

-proﬁle. It is assumed that all pixels with the same index tshare

values of the parameters ∆t,

Parameter values ∆1,

1,...,∆T,

Tare estimated by minimizing

the following energy function using DP over t:

E=∑

n∈TU

Dn(

n) +

∑

t=1

V(∆t,

t,∆t+1,

t+1)(12)

where ˜

Vis a smoothing regularizer:

V(∆,

,∆0,

0) =

1(∆−∆0)2+

−

0)2,(13)

whose role is to encourage

-values to vary smoothly as tincreases,

along the curve C(and we take

1=50 and

2=103). For the DP

computation, values of ∆tare discretised into 30 levels and

tinto

10 levels. A general smoothness term ˜

Vwould require a quadratic

time in the number of proﬁles to move from tto t+1, however our

regularizer allows a linear time algorithm using distance transforms

[Rucklidge 1996]. If the contour Cis closed, minimization cannot

be done exactly using single-pass DP and we approximate by using

two complete passes of DP, assuming that the ﬁrst pass gives the

optimal proﬁle for t=T/2.

The data term is deﬁned as

Dn(

n) = −logNzn;

t(n)(

n),Σt(n)(

n)(14)

where N(z;

,Σ)denotes a Gaussian probability density for zwith

mean

and covariance Σ. Mean and covariance for (14) are deﬁned

for matting as in [Ruzon and Tomasi 2000]:

) = (1−

)

t(0) +

αµ

t(1)(15)

Σt(

) = (1−

)2Σt(0) +

2Σt(1).

The Gaussian parameters

),Σt(

=0,1 for foreground

and background are estimated as the sample mean and covariance

from each of the regions Ftand Btdeﬁned as Ft=St∩TFand Bt=

St∩TB, where Stis a square region of size L×Lpixels centred on

the segmentation boundary Cat t(and we take L=41).

4.2 Foreground estimation

The aim here is to estimate foreground pixel colours without

colours bleeding in from the background of the source image. Such

bleeding can occur with Bayes matting because of the probabilis-

tic algorithm used which aims to strip the background component

from mixed pixels but cannot do so precisely. The residue of the

stripping process can show up as colour bleeding. Here we avoid

this by stealing pixels from the foreground TFitself. First the Bayes

matte algorithm [Chuang et al. 2001, eq. (9)] is applied to obtain an

estimate of foreground colour ˆ

fnon a pixel n∈TU. Then, from the

neighbourhood Ft(n)as deﬁned above, the pixel colour that is most

similar to ˆ

fnis stolen to form the foreground colour fn. Finally, the

combined results of border matting, using both regularised alpha

computation and foreground pixel stealing, are illustrated in ﬁg. 7.

Knockout 2 Bayes Matte GrabCut

Figure 7: Comparing methods for border matting. Given the

border trimap shown in ﬁg. 6, GrabCutobtains a cleaner matte than

either Knockout 2 or Bayes matte, due to the regularised

-proﬁle.

5 Results and Conclusions

Various results are shown in ﬁgs. 1, 8. The images in ﬁg. 1 each

show cases in which the bounding rectangle alone is a sufﬁcient

user interaction to enable foreground extraction to be completed

automatically by GrabCut. Fig. 8 shows examples which are in-

creasingly more difﬁcult. Failures, in terms of signiﬁcant amount

of user interactions, can occur in three cases: (i) regions of low

contrast at the transition from foreground to background (e.g. black

part of the butterﬂy’s wing ﬁg. 8); (ii) camouﬂage, in which the true

foreground and background distributions overlap partially in colour

space (e.g. soldier’s helmet ﬁg. 5); (iii) background material inside

the user rectangle happens not to be adequately represented in the

background region (e.g. swimmer ﬁg. 8). User interactions for the

third case can be reduced by replacing the rectangle with a lasso2.

The fox image (ﬁg. 8) demonstrates that our border matting method

can cope with moderately difﬁcult alpha mattes. For difﬁcult alpha

mattes, like the dog image (ﬁg. 8), the matting brush is needed.

The operating time is acceptable for an interactive user interface,

e.g. a target rectangle of size 450x300 pixels (butterﬂy ﬁg. 8) re-

quires 0.9sec for initial segmentation and 0.12sec after each brush

stroke on a 2.5 GHz CPU with 512 MB RAM.

Comparison of Graph Cut [Boykov et. al. 2001] and GrabCut .

In a ﬁrst experiment the amount of user interaction for 20 segmen-

tation tasks were evaluated. In 15 moderately difﬁcult examples

(e.g. ﬂowers ﬁg. 1) GrabCut needed signiﬁcantly fewer interac-

tions. Otherwise both methods performed comparably2.

In a second experiment we compare GrabCut using a single “out-

side” lasso (e.g. red line in ﬁg. 2c (top)) with Graph Cut using two

lassos, outer and inner (e.g. red and white lines in ﬁg. 2c (top)).

Even with missing foreground training data, GrabCutperforms at

almost the same level of accuracy as Graph Cut. Based on a ground

truth database of 50 images2, the error rates are 1.36 ±0.18% for

Graph Cut compared with 2.13±0.19% for GrabCut .

In conclusion, a new algorithm for foreground extraction has

been proposed and demonstrated, which obtains foreground alpha

mattes of good quality for moderately difﬁcult images with a rather

modest degree of user effort. The system combines hard segmenta-

tion by iterative graph-cut optimisation with border matting to deal

with blur and mixed pixels on object boundaries.

Image credits. Images in ﬁg. 5, 8 and ﬂower image in ﬁg.

(ﬁg. 1, 2) is courtesy of Matthew Brown.

2see www.research.microsoft.com/vision/cambridge/segmentation/

No User

Interaction

Figure 8: Results using GrabCut. The ﬁrst row shows the original images with superimposed user input (red rectangle). The second row

displays all user interactions: red (background brush), white (foreground brush) and yellow (matting brush). The degree of user interaction

increases from left to right. The results obtained by GrabCut are visualized in the third row. The last row shows zoomed portions of the

respective result which documents that the recovered alpha mattes are smooth and free of background bleeding.

Acknowledgements. We gratefully acknowledge discussions

with and assistance from P. Anandan, C.M. Bishop, M. Gangnet,

P. Perez, P. Torr and M. Brown.

References

ADOBE SYSTEMS INCORP. 2002. Adobe Photoshop User Guide.

BLAKE, A., ROTHER, C., BROWN, M., PEREZ, P., AND TORR,

P. 2004. Interactive Image Segmentation using an adaptive

GMMRF model. In Proc. European Conf. Computer Vision.

BOYKOV, Y., AND JOLLY, M.-P. 2001. Interactive graph cuts

for optimal boundary and region segmentation of objects in N-D

images. In Proc. IEEE Int. Conf. on Computer Vision, CD–ROM.

BOYKOV, Y., AND KOLMOGOROV, V. 2003. Computing

Geodesics and Minimal Surfaces via Graph Cut. In Proc. IEEE

Int. Conf. on Computer Vision.

CASELLES, V., KIMMEL, R., AND SAPIRO, G. 1995. Geodesic

active contours. In Proc. IEEE Int. Conf. on Computer Vision.

CHUANG, Y.-Y., CURLESS, B., SALESIN, D., AND SZELISKI, R.

2001. A Bayesian approach to digital matting. In Proc. IEEE

Conf. Computer Vision and Pattern Recog., CD–ROM.

COREL CORPORATION. 2002. Knockout user guide.

DEMPSTE R, A., LAIRD, M., AND RUBIN, D. 1977. Maximum

likelihood from incomplete data via the EM algorithm. J. Roy.

Stat. Soc. B. 39, 1–38.

GREIG, D., PORTEOUS, B., AND SEHEULT, A. 1989. Exact MAP

estimation for binary images. J. Roy. Stat. Soc. B. 51, 271–279.

KASS, M., WITKIN, A., AND TERZOPOULOS, D. 1987. Snakes:

Active contour models. In Proc. IEEE Int. Conf. on Computer

Vision, 259–268.

KOLMOGOROV, V., AND ZABIH, R. 2002. What energy functions

can be minimized via graph cuts? In Proc. ECCV. CD–ROM.

KWATRA, V., SCH ¨

ODL, A., ESSA, I., TURK, G., AND BOBICK,

A. 2003. Graphcut Textures: Image and Video Synthesis Using

Graph Cuts. Proc. ACM Siggraph, 277–286.

MORTENSEN, E., AND BARRETT, W. 1995. Intelligent scissors

for image composition. Proc. ACM Siggraph, 191–198.

MORTENSEN, E., AND BARRETT, W. 1999. Tobogan-based intel-

ligent scissors with a four parameter edge model. In Proc. IEEE

Conf. Computer Vision and Pattern Recog., vol. 2, 452–458.

RUCKLIDGE, W. J. 1996. Efﬁcient visual recognition using the

Hausdorff distance. LNCS. Springer-Verlag, NY.

RUZON, M., AND TOMASI, C. 2000. Alpha estimation in natural

images. In Proc. IEEE Conf. Comp. Vision and Pattern Recog.

The ULS23 Challenge: a Baseline Model and Benchmark Dataset for 3D Universal Lesion Segmentation in Computed Tomography

Preprint

Full-text available

Jun 2024

Size measurements of tumor manifestations on follow-up CT examinations are crucial for evaluating treatment outcomes in cancer patients. Efficient lesion segmentation can speed up these radiological workflows. While numerous benchmarks and challenges address lesion segmentation in specific organs like the liver, kidneys, and lungs, the larger variety of lesion types encountered in clinical practice demands a more universal approach. To address this gap, we introduced the ULS23 benchmark for 3D universal lesion segmentation in chest-abdomen-pelvis CT examinations. The ULS23 training dataset contains 38,693 lesions across this region, including challenging pancreatic, colon and bone lesions. For evaluation purposes, we curated a dataset comprising 775 lesions from 284 patients. Each of these lesions was identified as a target lesion in a clinical context, ensuring diversity and clinical relevance within this dataset. The ULS23 benchmark is publicly accessible via uls23.grand-challenge.org, enabling researchers worldwide to assess the performance of their segmentation methods. Furthermore, we have developed and publicly released our baseline semi-supervised 3D lesion segmentation model. This model achieved an average Dice coefficient of 0.703 $\pm$ 0.240 on the challenge test set. We invite ongoing submissions to advance the development of future ULS models.

POPCat: Propagation of particles for complex annotation tasks

Preprint

Full-text available

Jun 2024

Novel dataset creation for all multi-object tracking, crowd-counting, and industrial-based videos is arduous and time-consuming when faced with a unique class that densely populates a video sequence. We propose a time efficient method called POPCat that exploits the multi-target and temporal features of video data to produce a semi-supervised pipeline for segmentation or box-based video annotation. The method retains the accuracy level associated with human level annotation while generating a large volume of semi-supervised annotations for greater generalization. The method capitalizes on temporal features through the use of a particle tracker to expand the domain of human-provided target points. This is done through the use of a particle tracker to reassociate the initial points to a set of images that follow the labeled frame. A YOLO model is then trained with this generated data, and then rapidly infers on the target video. Evaluations are conducted on GMOT-40, AnimalTrack, and Visdrone-2019 benchmarks. These multi-target video tracking/detection sets contain multiple similar-looking targets, camera movements, and other features that would commonly be seen in "wild" situations. We specifically choose these difficult datasets to demonstrate the efficacy of the pipeline and for comparison purposes. The method applied on GMOT-40, AnimalTrack, and Visdrone shows a margin of improvement on recall/mAP50/mAP over the best results by a value of 24.5%/9.6%/4.8%, -/43.1%/27.8%, and 7.5%/9.4%/7.5% where metrics were collected.

Segmenting Dead Sea Scroll Fragments for a Scientific Image Set

Preprint

Full-text available

Jun 2024

This paper presents a customized pipeline for segmenting manuscript fragments from images curated by the Israel Antiquities Authority (IAA). The images present challenges for standard segmentation methods due to the presence of the ruler, color, and plate number bars, as well as a black background that resembles the ink and varying backing substrates. The proposed pipeline, consisting of four steps, addresses these challenges by isolating and solving each difficulty using custom tailored methods. Further, the usage of a multi-step pipeline will surely be helpful from a conceptual standpoint for other image segmentation projects that encounter problems that have proven intractable when applying any of the more commonly used segmentation techniques. In addition, we create a dataset with bar detection and fragment segmentation ground truth and evaluate the pipeline steps qualitatively and quantitatively on it. This dataset is publicly available to support the development of the field. It aims to address the lack of standard sets of fragment images and evaluation metrics and enable researchers to evaluate their methods in a reliable and reproducible manner.

Learning from Exemplars for Interactive Image Segmentation

Preprint

Full-text available

Jun 2024

Interactive image segmentation enables users to interact minimally with a machine, facilitating the gradual refinement of the segmentation mask for a target of interest. Previous studies have demonstrated impressive performance in extracting a single target mask through interactive segmentation. However, the information cues of previously interacted objects have been overlooked in the existing methods, which can be further explored to speed up interactive segmentation for multiple targets in the same category. To this end, we introduce novel interactive segmentation frameworks for both a single object and multiple objects in the same category. Specifically, our model leverages transformer backbones to extract interaction-focused visual features from the image and the interactions to obtain a satisfactory mask of a target as an exemplar. For multiple objects, we propose an exemplar-informed module to enhance the learning of similarities among the objects of the target category. To combine attended features from different modules, we incorporate cross-attention blocks followed by a feature fusion module. Experiments conducted on mainstream benchmarks demonstrate that our models achieve superior performance compared to previous methods. Particularly, our model reduces users' labor by around 15\%, requiring two fewer clicks to achieve target IoUs 85\% and 90\%. The results highlight our models' potential as a flexible and practical annotation tool. The source code will be released after publication.

Vessel Re-identification and Activity Detection in Thermal Domain for Maritime Surveillance

Preprint

Jun 2024

Maritime surveillance is vital to mitigate illegal activities such as drug smuggling, illegal fishing, and human trafficking. Vision-based maritime surveillance is challenging mainly due to visibility issues at night, which results in failures in re-identifying vessels and detecting suspicious activities. In this paper, we introduce a thermal, vision-based approach for maritime surveillance with object tracking, vessel re-identification, and suspicious activity detection capabilities. For vessel re-identification, we propose a novel viewpoint-independent algorithm which compares features of the sides of the vessel separately (separate side-spaces) leveraging shape information in the absence of color features. We propose techniques to adapt tracking and activity detection algorithms for the thermal domain and train them using a thermal dataset we created. This dataset will be the first publicly available benchmark dataset for thermal maritime surveillance. Our system is capable of re-identifying vessels with an 81.8% Top1 score and identifying suspicious activities with a 72.4\% frame mAP score; a new benchmark for each task in the thermal domain.

Anomaly Multi-classification in Industrial Scenarios: Transferring Few-shot Learning to a New Task

Preprint

Jun 2024

In industrial scenarios, it is crucial not only to identify anomalous items but also to classify the type of anomaly. However, research on anomaly multi-classification remains largely unexplored. This paper proposes a novel and valuable research task called anomaly multi-classification. Given the challenges in applying few-shot learning to this task, due to limited training data and unique characteristics of anomaly images, we introduce a baseline model that combines RelationNet and PatchCore. We propose a data generation method that creates pseudo classes and a corresponding proxy task, aiming to bridge the gap in transferring few-shot learning to industrial scenarios. Furthermore, we utilize contrastive learning to improve the vanilla baseline, achieving much better performance than directly fine-tune a ResNet. Experiments conducted on MvTec AD and MvTec3D AD demonstrate that our approach shows superior performance in this novel task.

POPCat: Propagation of Particles for Complex Annotation Tasks

Article

May 2024

Extreme Point Supervised Instance Segmentation

Preprint

Full-text available

May 2024

This paper introduces a novel approach to learning instance segmentation using extreme points, i.e., the topmost, leftmost, bottommost, and rightmost points, of each object. These points are readily available in the modern bounding box annotation process while offering strong clues for precise segmentation, and thus allows to improve performance at the same annotation cost with box-supervised methods. Our work considers extreme points as a part of the true instance mask and propagates them to identify potential foreground and background points, which are all together used for training a pseudo label generator. Then pseudo labels given by the generator are in turn used for supervised learning of our final model. On three public benchmarks, our method significantly outperforms existing box-supervised methods, further narrowing the gap with its fully supervised counterpart. In particular, our model generates high-quality masks when a target object is separated into multiple parts, where previous box-supervised methods often fail.

Infrared Small Target Detection Based on Density Peak Search and Local Features

Article

Full-text available

May 2024
IET SIGNAL PROCESS

The detection of small infrared targets is still a challenging task and efficient and accurate detection plays a key role in modern infrared search and tracking military applications. However, small infrared targets are difficult to detect due to their weak brightness, small size and lack of shape, structure, texture, and other information elements. In this paper, we propose a target detection method. First, to address the problem that the proximity of targets to high-brightness clutter leads to missed detection of candidate targets, a Gaussian differential filtering preprocessed image is used to suppress high-brightness clutter. Second, a density-peaked global search method is used to determine the location of candidate targets in the preprocessed image. We then use local contrast to the candidate target points to enhance the gradient features and suppress background clutter. The Facet model is used to compute multidirectional gradient features at each point. A new efficient surrounding symmetric region partitioning scheme is constructed to capture the gradient characteristics of targets of different sizes in eight directions, followed by weighting the candidate target gradient characteristics using the standard deviation of the symmetric region difference. Finally, an adaptive threshold segmentation method is used to extract small targets. Experimental results show that the method proposed in this paper has better detection accuracy and robustness compared with other detection methods.

CloSe: A 3D Clothing Segmentation Dataset and Model

Conference Paper

Mar 2024

Exact Maximum A Posteriori Estimation for Binary Images

Article

Full-text available

Jan 1989

In this paper, for a degraded two‐colour or binary scene, we show how the image with maximum a posteriori (MAP) probability, the MAP estimate, can be evaluated exactly using efficient variants of the Ford–Fulkerson algorithm for finding the maximum flow in a certain capacitated network. Availability of exact estimates allows an assessment of the performance of simulated annealing and of MAP estimation itself in this restricted setting. Unfortunately, the simple network flow algorithm does not extend in any obvious way to multicolour scenes. However, the results of experiments on two‐colour images suggest that, in general, simulated annealing, according to practicable ‘temperature’ schedules, can produce poor approximations to the MAP estimate to which it converges.

Toboggan-Based Intelligent Scissors with a Four-Parameter Edge Model.

Conference Paper

Full-text available

Jan 1999
IEEE Comput Soc Conf Comput Vis Pattern Recogn

Intelligent Scissors is an interactive image segmentation tool that allows a user to select piece-wise globally optimal contour segments that correspond to a desired object bound- ary. We present a new and faster method of computing the optimal path by over-segmenting the image using toboggan- ing and then imposing a weighted planar graph on top of the resulting region boundaries. The resulting region-based graph is many times smaller than the pixel-based graph used previously, thus providing faster graph searches and immediate user interaction. Further, the region-based graph provides an efficient framework to compute a 4 parameter edge model, allowing subpixel localization as well as a measure of edge blur.

Interactive Image Segmentation Using an Adaptive GMMRF Model

Conference Paper

Full-text available

May 2004

The problem of interactive foreground/background segmentation in still images is of great practical importance in image editing. The state of the art in interactive segmentation is probably represented by the graph cut algorithm of Boykov and Jolly (ICCV 2001). Its underlying model uses both colour and contrast information, together with a strong prior for region coherence. Estimation is performed by solving a graph cut problem for which very efficient algorithms have recently been developed. However the model depends on parameters which must be set by hand and the aim of this work is for those constants to be learned from image data. First, a generative, probabilistic formulation of the model is set out in terms of a “Gaussian Mixture Markov Random Field” (GMMRF). Secondly, a pseudolikelihood algorithm is derived which jointly learns the colour mixture and coherence parameters for foreground and background respectively. Error rates for GMMRF segmentation are calculated throughout using a new image database, available on the web, with ground truth provided by a human segmenter. The graph cut algorithm, using the learned parameters, generates good object-segmentations with little interaction. However, pseudolikelihood learning proves to be frail, which limits the complexity of usable models, and hence also the achievable error rate.

Intelligent scissors for image composition

Conference Paper

Full-text available

Jan 1995

We present a new, interactive tool called Intelligent Scissors which we use for image segmentation and composition. Fully auto- mated segmentation is an unsolved problem, while manual tracing is inaccurate and laboriously unacceptable. However, Intelligent Scissors allow objects within digital images to be extracted quickly and accurately using simple gesture motions with a mouse. When the gestured mouse position comes in proximity to an object edge, a live-wire boundary "snaps" to, and wraps around the object of interest. Live-wire boundary detection formulates discrete dynamic pro- gramming (DP) as a two-dimensional graph searching problem. DP provides mathematically optimal boundaries while greatly reducing sensitivity to local noise or other intervening structures. Robustness is further enhanced with on-the-fly training which causes the boundary to adhere to the specific type of edge currently being followed, rather than simply the strongest edge in the neigh- borhood. Boundary cooling automatically freezes unchanging seg- ments and automates input of additional seed points. Cooling also allows the user to be much more free with the gesture path, thereby increasing the efficiency and finesse with which boundaries can be extracted. Extracted objects can be scaled, rotated, and composited using live-wire masks and spatial frequency equivalencing. Frequency equivalencing is performed by applying a Butterworth filter which matches the lowest frequency spectra to all other image compo- nents. Intelligent Scissors allow creation of convincing composi- tions from existing images while dramatically increasing the speed and precision with which objects can be extracted.

Geodesic Active Contours

Article

Full-text available

Feb 1997

A novel scheme for the detection of object boundaries is presented. The technique is based on active contours evolving in time according to intrinsic geometric measures of the image. The evolving contours naturally split and merge, allowing the simultaneous detection of several objects and both interior and exterior boundaries. The proposed approach is based on the relation between active contours and the computation of geodesics or minimal distance curves. The minimal distance curve lays in a Riemannian space whose metric is defined by the image content. This geodesic approach for object segmentation allows to connect classical "snakes" based on energy minimization and geometric active contours based on the theory of curve evolution. Previous models of geometric active contours are improved, allowing stable boundary detection when their gradients suffer from large variations, including gaps. Formal results concerning existence, uniqueness, stability, and correctness of the evolution a...

Interactive Graph Cuts for Optimal Boundary & Region Segmentation of Objects in N-D Images

Article

Jul 2001

In this paper we describe a new technique for general purpose interactive segmentation of N-dimensional images . The user marks certain pixels as "object" or "background" to provide hard constraints for segmentation. Additional soft constraints incorporate both boundary and region in- formation. Graph cuts are used to find the globally optimal segmentation of the N-dimensional image. The obtained so- lution gives the best balance of boundary and region prop- erties among all segmentations satisfying the constraints . The topology of our segmentation is unrestricted and both "object" and "background" segments may consist of sev- eral isolated parts. Some experimental results are present ed in the context of photo/video editing and medical image seg- mentation. We also demonstrate an interesting Gestalt ex- ample. A fast implementation of our segmentation method is possible via a new max-flow algorithm in (2).

Exact MAP estimation for binary images

Article

Snakes: Active Contour Models

Article

Jan 1987

Maximum Likelihood from Incomplete Data Via EM Algorithm

Article

Sep 1977

S ummary A broadly applicable algorithm for computing maximum likelihood estimates from incomplete data is presented at various levels of generality. Theory showing the monotone behaviour of the likelihood and convergence of the algorithm is derived. Many examples are sketched, including missing value situations, applications to grouped, censored or truncated data, finite mixture models, variance component estimation, hyperparameter estimation, iteratively reweighted least squares and factor analysis.

Efficient Visual Recognition Using the Hausdorff Distance

Book

Jan 1996

William Rucklidge

GrabCut: Interactive Foreground Extraction Using Iterated Graph Cuts

Abstract

Recommended publications

"GrabCut"

Cosegmentation of Image Pairs by Histogram Matching - Incorporating a Global Constraint into MRFs

Interactive Foreground Extraction using graph cut

Object cosegmentation