Content uploaded by Keisuke Kameyama
Author content
All content in this area was uploaded by Keisuke Kameyama on Aug 20, 2014
Content may be subject to copyright.
Semiconductor Defect Classification using Hyperellipsoid Clustering
Neural Networks and Model Switching
Keisuke Kameyama
Interdisciplinary Graduate School of Sci. and Eng.
Tokyo Institute of Technology
Yokohama 226-8502, Japan
Yukio Kosugi
Frontier Collaborative Research Center
Tokyo Institute of Technology
Yokohama 226-8503, Japan
Abstract
An automatic defect classification (ADC) system for visual
inspection of semiconductor wafers, using a neural network
classifier is introduced. The proposed Hyperellipsoid Clus-
tering Network (HCN) employing a Radial Basis Function
(RBF) in the hidden layer, is trained with additional penalty
conditions for recognizing unfamiliar inputs as originat-
ing from an unknown defect class. Also, by using a dy-
namic model alteration method called Model Switching, a
reduced-model classifier which enables an efficient classifi-
cation is obtained. In the experiments, the effectiveness of
the unfamiliar input recognition was confirmed, and a clas-
sification rate sufficiently high for use in the semiconductor
fab was obtained.
1. Introduction
Visual inspection plays an important role in the manufac-
turing processes of semiconductors. The disorders found
on the wafer surface, such as the one shown in Fig. 1, are
commonly referred to as defects. The motive for defect clas-
sification is to find out the process stages and the sources
that are causing them. Early detection of the sources of de-
fects is essential in order to maintain high product yield and
quality. Fig. 1
By replacing the review process typically conducted by hu-
man experts, it is also aimed to improve both the stability
and speed of inspection. In the literature, it is reported that
the classification accuracies of human experts are typically
60–80% [1]. If this stage of visual inspection could be auto-
mated, it will greatly contribute to enhance the productivity
of the semiconductor fab.
The task of classifying the defect image features has several
5µ
m
Figure 1. A defect found on a semiconductor
wafer.
specific conditions inherent to the particular problem. Most
distictive among them is the fact that the user does not have
the freedom of collecting a sufficient number of, or an ap-
propriate selection of training images. Also, the number of
the training samples are extremely unbalanced.
When the number of samples for a defect class is small, ap-
proaches whose decisions rely on all samples, such as the
radial basis function (RBF) networks [10][12] or the joint
use of nonparametric estimation of the probability distri-
bution function by Parzen’s method [11] and Bayes clas-
sification, perform well. However, for a class with large
samples, these methods are computationally costly. In this
case, instead of using all the training samples for classifi-
cation, methods based on distances from the class-cluster
prototypes such as the
nearest neighbor algorithm [2] and
learning vector quantization [9], and those based on class
borders such as multilayer perceptrons (MLP) [14] and sup-
port vector machines [15] are computationally more effi-
cient. So-called reduced variants of the above nonparamet-
ric methods such as the generalized RBF networks [12] and
reduced Parzen classifiers [3] are also methods depending
on the distances from the prototypes.
In this work, a three-layered neural network named the
Hyperellipsoid Clustering Network (HCN), having hidden
layer units of RBF type will be used. In addition to the pa-
rameter adjustment by backpropagation (BP) method [14],
model alteration method called Model Switching (MS) [7]
which allows the map acquired by training to be inherited
to the new model, is used during the training process for
efficiently obtaining an appropriate reduced model.
The second requirement to the system is to classify the
known defect classes without fail and not to make wild
guesses against unfamiliar defects. Such cases should be
pointed out as unclassifiable and be left for the human ex-
pert to see. Since the training set will usually provide an-
swers at only a small portion of the feature space, inputs
to the remaining open space should be treated as being un-
known. For recognizing unfamiliar inputs, the HCN was
trained with additional penalty condition, so that the sizes
of the hyperellipsoid kernels will be kept small, to tightly
enclose the clusters formed by the training samples.
In Sec. 2, the HCN will be introduced, together with its
training method and the output interpretation method for
recognition of unfamiliar inputs. In Sec. 3, the idea of
Model Switching for allowing dynamic model alteration
during training will be reviewed. The defect classes and the
outline of the automatic defect classification (ADC) system
will be explained in Sec. 4. In Sec. 5, the network and
the ADC system will be evaluated by applying to the clas-
sification of the defect image sets, and the paper will be
concluded in Sec. 6.
2. Hyperellipsoid clustering network (HCN)
The three-layered network model used for classifying the
feature vectors is illustrated in Fig. 2. The network has
inputs,
hidden units and
output units. The potential of
the
-th hidden layer unit is defined as,
(1)
with the following parameters to be adjusted in the training:
! #"
: radius parameter.
$&%'()(*(+%-,.0/1 #"
,
: center vector.
32 46587*9 :"<;$=>"
,
: weight matrix.
The transfer function of the hidden layer unit is the well-
known sigmoid function. Thus, the output of unit
is,
?
A@BCDFEHGJILKM6NO
BCQP-R
'
(
(2)
Input
vec
S
tor
x
Output
vec
S
tor
y
Input la
T
yer Hidden layer
(Hypere
U
llipsoid
discriminant + sigmoid)
Output layer
(Linear)
Parameters : (Hn
V
, mn, rn)
Σ
W
Σ
W
Σ
W
Connection
weigh
X
t : wk
1
Y
l
Z
L
[
1
Y
n
\
N
]
1
Y
k
^
O
_
Figure 2. The Hyperellipsoid Clustering Net-
work (HCN).
x1x2
1.0
0
h1h2
Figure 3. An example of the kernel functions
made by the joint use of (hyper) ellipsoid dis-
criminants and sigmoid functions.
A unit in the output layer takes the fan out of the hidden
layer units and calculates the weighted sum with no bias as,
`abdc
/
afe
(3)
where
cgaAhija-'k(*(*( ija
;
/ l";
is the weight vector
of the
-th output unit, and
e
m
?
'(*()(
?
;
/
A"<;
. The
weight vector is also modified in the training process.
By employing a discriminant in Eq. (1), the discrimina-
tion plane in the feature space will always be a hyperellip-
soid. Since the unit potential in Eq. (1) depends on the
distance between the input
and the center vector
,
the network is a RBF network. However, in contrast with
the popular Gaussian RBF network [12], various profiles
of the kernel function are possible by controlling the gain
[4] of the sigmoid function with the radius parameter
, as
shown in Fig. 3. This network model using the hyperel-
lipsoid discriminant and the sigmoid function in the hidden
layer, will be referred to as the Hyperellipsoid Clustering
Network (HCN).
The training method used in the HCN is based on the
batched BP law with momentum terms [14]. The error cri-
terion
n
is defined as
n
G
o
p
q
rs
'
n
r
G
tuo
p
q
rs
'
wv
r
yx
r
(4)
with
o
,
n
r
,
v
r
z"|{
and
x
r
F"|{
denoting the cardi-
nality of the training set, the error for the
}
-th training pair,
the
}
-th training output vector and the
}
-th output vector,
respectively.
For enabling a “tight bounding by hyperellipsoids” to im-
plement the recognition of the unfamiliar inputs, the vol-
ume of the hyperellipsoids should be kept small as long
as it does not harm the achievement of training. This can
be done by setting some penalty term to restrict the radius
of the hyperellipsoids. The distance from the center to the
edge of the hyperellipsoid in the direction of the
~
-th princi-
pal component can be written as
w
- B
, where
is the
~
-th eigenvalue of the matrix
/
f
, which is always pos-
itive. Thus, a penalty to suppress the absolute value of the
radius parameter
w
can be considered to be effective. Also,
a term to prevent the eigenvalues from becoming too small,
was necessary. This second restriction was implemented in-
directly by preventing the Euclidean norm of the matrix
from becoming too small. Consequently, the modification
measures to the weight matrix
and the radius parameter
were formulated as,
-
p
I1>
(5)
and
w
-
p
(6)
with the terms
-
p
and
u
p
denoting the modifica-
tion measures by the plain BP training. Parameters
and
denote the penalty term gains.
The network will be trained to respond with a class specific
unit vector. Since the output is the weighted sum of the
kernel functions of the hidden layer units, it can be justified
to reject an output vector that does not have a significant
winner. In such a case, the input pattern should be classified
to be originating from an unknown class. Therefore, the
output interpretation of,
6k|$
argmax
a
&`aH
if
`a
unknown
otherwise
(7)
will be used, with
being the membership threshold.
3. Model switching
As a method for obtaining a reduced network model in
the learning process, model alteration scheme called Model
Switching (MS) [7] is employed. MS is a framework for dy-
namic model alteration during the BP training for improve-
ment of the training efficiency, by avoiding the local minima
and reducing the redundancy in the network model.
Definition 1 (Model Switching) On altering the neural
network model, methods which determine the moment or
the occasion of model alteration, by taking into account
both the two factors in the following :
1. The nature and fitness of the new model and the initial
map candidate within the new model.
2. The status of the immediate model and map.
will be referred to as Model Switching (MS).
In this work, MS will be used to reduce the number of hid-
den layer units in the HCN in which the training is initially
started with a model having the same number of units as
the training sample. Pruning algorithms [13] which is also
an attempt to reduce the network size, mostly limit the oc-
casion of model reduction to after the convergence of the
training error. With MS, however, the occasion can be set at
any time, as long as the fitness of the candidate of the initial
map within the new model is met. When only the model
reduction is used in MS, only the first factor in Def. 1 needs
to be considered.
The process of training by BP with MS is shown in Fig. 4.
For each training epoch of BP, the fitness of the switchable
candidates will be evaluated, and switching will take place
when the fitness
Q
of a candidate exceeds a given threshold
O
.
The candidate set of the new model and map was made by
using the unit reduction method of unit fusion [6]. Unit
fusion selects a pair of units in a layer and replaces them
by a single unit. On replacement, connection weights to the
new unit is determined so that the map of the old network
will be inherited by the new network.
Let us put that units indexed
~
and
will be fused to make a
single unit
~
. The weighted sum of the inputs from units
~
,
and the unity bias
to the subsequent layer unit
, can be
written as,
Ba ija
`
ILija`QIija
ija
¡
IL¢
I1ijaQH&¡£¤IL¢Qu¥Iijak
(8)
Sta
¦
rt
Modify parameter of the
immediate network fN by BP
En
§
d
Determine switchable
model-map ca
¨
ndidate set
CMS
Evaluate Fitness Index
IF(fN , fN i )
for all fN i ∈ CMS
ma
©
x{I
ª
F
«
(
¬
f
N
®
,f
N
®
i)}
¯
>
°
I
ª
F
«
0
N
Y
Switch fN → fNk
k
±
=
²
argm
³
ax
i
´
{
µ
No swi
¶
tching
Trained ?
(E < E0)
N
Y
Model size
reduction
I
ª
F
«
(
¬
f
N
®
,f
N
®
i)}
¯
Figure 4. BP training with Model Switching.
where
i
,
`
,
¡
and
¢
are the connection weight, unit re-
sponse, average unit response and the varying portion of
the response, respectively. Generally, we can put,
¢Q|·
¸
&
u-¹
¹
¢
(9)
with
¹
and
denoting the standard deviation of the unit
output, and the output similarity of the unit pair, respec-
tively, both evaluated for all the training inputs. From Eqs.
8 and 9, we have
Ba ·
Eija
I1¸
&
u ¹
¹
ijaQ-Pº`
ILijak
I|i aQ Eº¡
Q¸
u¹
¹
¡
P
(10)
implying that the connection weights should be changed as,
i
a
¼»
ija
IQ¸
º ¹
¹
ija
(11)
and
i
aijaOIija-E¡£
¸
&
u
¹
¹
¡
P
(12)
where the prime denote the connection weights afterthe fu-
sion.
Since no bias unit is used in the hidden layer of HCN, only
the compensation in Eq. 11 will be used. As unit fusion can
be applied to all unit pairs in the hidden layer,
'
G
switching candidates exist. The one which is most fit will
be selected by evaluating the fitness index
¾½
;
½
;
.
The fitness of the new map will be a function of the degree
of map inheritance, and the closeness of the two kernels to
be fused in the feature space, to give priority to the fusion
of kernels that are placed close together. For evaluating the
degree of map inheritance, a measure named Map Distance
will be used.
Definition 2 (Map Distance) The map distance between
two mapping vector functions
½
;
'
À¿O
and
½
;
À¿O
trained
with the training vector set
EHÀ¿
rÂÁ
r
QP
p
rs
'
is defined as,
Ã
¾½
;
'
½
;
f
G
o
p
q
rs
'
½
;
'
¿
r
½
;
¿
r
(13)
where
o
is the number of training pairs.
The fitness of the candidates will be evaluated by the fitness
index function of,
8½
;
½
;
ÄÅ
Æ
ÅÇ
È
,CÉ
RÊ
Ë
R
Ì
Ê
ÉÍÎ
È£ÏjÐCѾÒ
R
ϤÓ
½fÔÖÕ+½
Ë Ì
ÔJ×
Î
,
É
ÏÐCÑ8Ò
~
@O&
Øu
0
`wÙ
?
¡ki
~
¡
(14)
where
½
;
,
and
ÃÛÚÝÜÞ
denote the map obtained by fus-
ing the
~
-th and
-th units, the dimension of feature space,
and the maximum possible map distance, respectively. It
is assumed that all the feature elements are bounded to the
Ø
G
domain. On actual evaluation of the map distance, the
theorem approximating the map distance generated by the
fusion of hidden layer units [7] was used.
4. The automatic defect classifier system
(ADC) [8]
4.1. Defect classes
In this work, we will try to classify the physical defect
classes that provide most information for locating the cause
of the defects. The physical defect classes dealt with in this
work and their common appearances are listed in the fol-
lowing :
A. Foreign objects (FO)
This class includes defects such that external objects are
found on the wafer. Defects of FO class tend to appear as
small and dark colored regions, typically in near-circular
shape.
B. Embedded foreign objects (EO)
This is the class of defects where one or more processed film
AN
ß
D
••
à
•
Defect mask
á
Shape feature
extraction
Color
quantization
Shape f
â
eature Color r
ã
atio
HCN Cl
ä
assifier
Defect
å
class
Reference
æ
image Defect i
ç
mage
••
à
•
Figure 5. The flow of data in the ADC system.
layers have beenstacked over a foreign object. EO class de-
fects appear slightly larger and irregular-shaped than those
of the FO class, because the patterns of the heaped area in
the covering layers are deformed by the embedded object.
In addition to the characteristic dark color of the particle it-
self, other colors can be observed as well. Defects of FO
and EO classes can appear quite similar, and are sometimes
hard to distinguish even for an expert.
C. Pattern failure (PF)
This class covers all kinds of defects that have pattern de-
formations without any existence of external objects. De-
fects of PF class can also be caused by insufficient exposure
or etching. Thus they can have a wide variety of size and
shape. Since the defect is usually an extra region or a lack
in the pattern of a layer, the color of the defect region tends
to be one of those observed in the normal patterns.
4.2. Feature extraction
A. Shape features
The flow of data in the ADC system is shown in Fig. 5. Af-
ter subtraction of the defectless reference pattern from the
defect image and further graylevel thresholding, the defect
mask is made. From the defect mask, shape features of de-
fect size and roundness is calculated.
B. Color features
The color of the defect region is characterized by quantiz-
ing the color of each pixel to one of the prototype colors.
The prototype colors are determined in beforehand by ap-
plying the Median Cut Algorithm [5] to the defectless im-
x1
0
è
1
é
1
é
x2
ê
(a) (b)
x1
0
è
1
é
1
é
x2
Figure 6. An artificially generated cluster data
of four classes. (a) Training set (P = 100). (b)
Test set (P= 1000).
ages of the layer to be inspected. Also, typical defect colors
are manually added as prototype colors. The ratios of the
quantized colors in the defect region were used as the color
feature vector of the defect.
In the experiments in Sec. 5, the feature dimension was
12, including the 2 shape features and 10 color features, all
normalized to unity range.
5. Experiment
A. Membership thresholding in an artificial cluster data
The effect of membership thresholding and MS was evalu-
ated using an artificial four-class data in a 2D domain shown
in Fig. 6. Three types of networks and training strategies
were tried. All networks were trained to the target error of
n
Ø(+ØG
, to respond with class specific unit vectors.
1. MLP with (input-hidden-output)=(2-4-4) units.
2. HCN with (input-hidden-o utput)=(2-100-4) units.
3. HCN trained by BP with MS for model reduction dur-
ing training. Initial model : (2-100-4).
The change in the recognition rate for the test set, and the
ratio of the area within the input domain which was pointed
out as being of unknown class, was evaluated by changing
the membership threshold
in Eq. 7. Ideally, the recogni-
tion rate will be maintained high, even when a large portion
of the input domain is judged as unknown (rejected). The re-
sult is shown in Fig. 7. It is clear that by reducing the model
of HCN by MS, larger portion of input domain is properly
rejected without losing the classification ability for the test
set.
0.7
0.75
0.8
0.85
0.9
0.95
1
0 0.1 0.2 0.3 0.4 0.5 0.6
θ = 0.2
θ = 0.9
ML
ë
P
HC
ì
N
HCN with
í
MS
Recognitio
î
n rate
Ratio of rejected inp
ï
ut domain
Figure 7. The change in the recognition rate
and the ratio of the rejected input domain,
when the membership threshold is changed.
Table 1. The classification rate and the con-
fusion matrix for the HCN evaluated by the
leave-one-out method. The numbers in bold
typeface are for the cases when membership
thresholding was used.
FO
ð
PFEO
ð
Tru
ñ
eEstim
³
ation Correc
ò
t (%)
32
32 1
00
0
2
022
21
0
0
2
12
0
32
30
97.0
97.0
88.9
83.3
91.7
87.5
Unkn
ó
own
0
1
0
3
0
5
Error (%)
3.0
0.0
11.1
2.8
8.3
0.0
Foreign Object
(FO)
Embedded Object
(EO)
Pattern Failure
(PF)
Average rate
U
s (weighted) 92.5
89.2 7.5
1.1
B. Leave-one-out evaluation with HCN using MS
A collection of defect images obtained from the same pro-
cess layer of a product was used for evaluating the ADC
system. The set consisted of 33 FO class, 36 EO class
and 24 PF class images. The class information for all the
images were provided by an expert inspector. The classi-
fication rates were evaluated by the leave-one-out method
[3]. A HCN network with unit configuration of (12-93-
3), initialized by placing each kernels at the training inputs
were trained using MS. The model typically converged to
reduced models with 9 to 14 hidden layer units.
The results are shown in Table 1. By employing the mem-
bership thresholding with
ôØ(+õ
, it is found that the non-
diagonal elements (errors) in the confusion matrix could be
reduced drastically. The obtained classification rate is con-
sidered to be comparable to those of human experts. By re-
ducing the network model by MS, the computation required
for using the network was also reduced by 85–90%, when
compared with the initial network model.
6. Conclusion
An ADC system for visual inspection of semiconductor
wafers, using a neural network classifier was introduced.
The Hyperellipsoid Clustering Network was introduced,
and the training rule with cost terms for recognizing unfa-
miliar inputs as originating from an unknown defect class
was given. Further, by using BP training with Model
Switching, a reduced-model classifier which enables an effi-
cient classification was obtained. The defect classes and the
descriptions of the extracted image features was defined. In
the experiments, the effectiveness of the unfamiliar input
recognition was confirmed, and a classification rate compa-
rable to those of human experts were obtained.
References
[1] P. B. Chou, A. R. Rao, M. C. Struzenbecker, F. Y. Wu, and
V. H. Brecher. Automatic defect classification for semicon-
ductor manufacturing. Machine Vision and Applications,
9(4):201–214, 1997.
[2] R. O. Duda and P. E. Hart. Pattern Classification and Scene
Analysis. Wiley, 1973.
[3] K. Fukunaga. Introduction to Statistical Pattern Recogni-
tion. Academic Press, 1990.
[4] R. Hecht-Nielsen. Neurocomputing. Addison-Wesley, 1990.
[5] P. Heckbert. Color image quantization for frame buffer dis-
play. Computer Graphics, 16(3):297–307, 1982.
[6] K. Kameyama and Y. Kosugi. Neural network pruning
by fusing hidden layer units. Transactions of IEICE,
E74(12):4198–4204, 1991.
[7] K. Kameyama and Y. Kosugi. Model switching by chan-
nel fusion for network pruning and efficient feature extrac-
tion. Proceedings of International Joint Conference on Neu-
ral Networks 1998, pages 1861–1866, 1998.
[8] K. Kameyama, Y. Kosugi, T. Okahashi, and M. Izumita. Au-
tomatic defect classification in visual inspection of semicon-
ductors using neural networks. IEICE Transactions on In-
formation and Systems, E81-D(11):1261–1271, 1998.
[9] T. Kohonen. Self-organization and associative memory.
Springer, 1988.
[10] J. E. Moody and C. J. Darken. Fast learning in networks of
locally-tuned processing units. Neural Computation, 1:281–
294, 1989.
[11] E. Parzen. On estimation of a probability density function
and mode. Annals of Mathematical Statistics, 33:1065–
1076, 1962.
[12] T. Poggio and F. Girosi. Networks for approximation and
learning. Proceedings of the IEEE, 78:1481–1497, 1990.
[13] R. Reed. Pruning algorithms a survey. IEEE Trans. Neural
Networks, 4(5):740–747, 1993.
[14] D. Rumelhart, J. L. McClelland, and the PDP Research
Group. Parallel distributed processing. MIT Press, 1986.
[15] V. N. Vapnik. Statistical Learning Theory. Wiley, 1999.