ArticlePDF Available

Face segmentation using skin-color map in videophone applications

July 1999
IEEE Transactions on Circuits and Systems for Video Technology 9(4):551 - 564

July 1999
9(4):551 - 564

DOI:10.1109/76.767122

Source
IEEE Xplore

Authors:

Douglas Chai

Edith Cowan University

King Ngi Ngan

The Chinese University of Hong Kong

This paper addresses our proposed method to automatically segment out a person's face from a given image that consists of a head-and-shoulders view of the person and a complex background scene. The method involves a fast, reliable, and effective algorithm that exploits the spatial distribution characteristics of human skin color. A universal skin-color map is derived and used on the chrominance component of the input image to detect pixels with skin-color appearance. Then, based on the spatial distribution of the detected skin-color pixels and their corresponding luminance values, the algorithm employs a set of novel regularization processes to reinforce regions of skin-color pixels that are more likely to belong to the facial regions and eliminate those that are not. The performance of the face-segmentation algorithm is illustrated by some simulation results carried out on various head-and-shoulders test images. The use of face segmentation for video coding in applications such as videotelephony is then presented. We explain how the face-segmentation results can be used to improve the perceptual quality of a videophone sequence encoded by the H.261-compliant coder

The use of color information for region segmentation.

…

Foreman image with a white contour highlighting the facial region.

…

Histograms of Cr and Cb components in the facial region.

…

Foreman and Carphone images, and their color segmentation results, obtained by using the same predefined skin-color map.

…

+14

Outline of face-segmentation algorithm.

…

Figures - uploaded by Douglas Chai

Content may be subject to copyright.

Content uploaded by Douglas Chai

Content may be subject to copyright.

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 9, NO. 4, JUNE 1999 551

Transactions Papers

Face Segmentation Using Skin-Color

Map in Videophone Applications

Douglas Chai, Student Member, IEEE, and King N. Ngan, Senior Member, IEEE

Abstract— This paper addresses our proposed method to au-

tomatically segment out a person’s face from a given image

that consists of a head-and-shoulders view of the person and a

complex background scene. The method involves a fast, reliable,

and effective algorithm that exploits the spatial distribution

characteristics of human skin color. A universal skin-color map

is derived and used on the chrominance component of the input

image to detect pixels with skin-color appearance. Then, based

on the spatial distribution of the detected skin-color pixels and

their corresponding luminance values, the algorithm employs a

set of novel regularization processes to reinforce regions of skin-

color pixels that are more likely to belong to the facial regions

and eliminate those that are not. The performance of the face-

segmentation algorithm is illustrated by some simulation results

carried out on various head-and-shoulders test images.

The use of face segmentation for video coding in applications

such as videotelephony is then presented. We explain how the

face-segmentation results can be used to improve the perceptual

quality of a videophone sequence encoded by the H.261-compliant

coder.

Index Terms— Color image processing, face location, facial

image analysis, H.261, image segmentation, quantization, video

coding, videophone communication.

I. INTRODUCTION

THE task of ﬁnding a person’s face in a picture seems

to be effortless for a human to perform. However, it is

far from simple for a machine of current technology to do

the same. In fact, development of such a machine or system

has been widely and actively studied in the ﬁeld of image

understanding for the past few decades with applications such

as machine vision and face recognition in mind. Moreover, in

recent years, the research activities in this area have intensiﬁed

as a result of its applications being extended toward video

representation and coding purposes.

The main objective of this research is to design a system that

can ﬁnd a person’s face from given image data. This problem is

commonly referred to as face location, face extraction, or face

segmentation. Regardless of the terminology, they all share

Manuscript received August 17, 1997; revised September 3, 1998. This

paper was recommended by Associate Editor S. Panchanathan.

The authors are with the Visual Communications Research Group, De-

partment of Electrical and Electronic Engineering, University of Western

Australia, Nedlands, Perth 6907 Australia.

Publisher Item Identiﬁer S 1051-8215(99)04160-9.

the same objective. However, note that the problem usually

deals with ﬁnding the position and contour of a person’s face

since its location is unknown, but given the knowledge of its

existence. If this is not known, then there is also a need to

discriminate between “images containing faces” and “images

not containing faces.” This is known as face detection. This

paper, however, focuses on face segmentation.

The signiﬁcance of this problem can be illustrated by its

vast applications, as face segmentation holds an important key

to future advances in human-to-human and human-to-machine

communications. The segmentation of a facial region provides

a content-based representation of the image where it can

be used for encoding, manipulation, enhancement, indexing,

modeling, pattern-recognition, and object-tracking purposes.

Some major applications include the following.

•Coding area of interest with better quality: The subjec-

tive quality of a very low-bit-rate encoded videophone

sequence can be improved by coding the facial image

region that is of interest to viewers at higher quality [1],

[2].

•Content-based representation and MPEG-4: Face seg-

mentation is a useful tool for the MPEG-4 content-based

functionality. It provides content-based representation of

the image, which can subsequently be used for coding,

editing, or other interactivity purposes.

•Three-dimensional (3-D) human face model ﬁtting: The

delimitation of the person’s face is the fundamental

requirement of 3-D human face model ﬁtting used in

model-based coding [3], computer animation, and mor-

phing.

•Image enhancement: Face segmentation information can

be used in a postprocessing task for enhancing images,

such as the automatic adjustment of tint in the facial

region.

•Face recognition: Finding the person’s face is the ﬁrst im-

portant step in the human face recognition, classiﬁcation,

and identiﬁcation systems.

•Face tracking: Face location can be used to design a video

camera system that tracks a person’s face in a room. It can

be used as part of an intelligent vision system or simply

in video surveillance.

1051–8215/99$10.00 1999 IEEE

552 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 9, NO. 4, JUNE 1999

Although the research on face segmentation has been pur-

sued at a feverish pace, there are still many problems yet to

be fully and convincingly solved as the level of difﬁculty of

the problem depends highly on the complexity level of the

image content and its application. Many existing methods only

work well on simple input images with a benign background

and frontal view of the person’s face. To cope with more

complicated images and conditions, many more assumptions

will then have to be made. Many of the approaches proposed

over the years involved the combination of shape, motion, and

statistical analysis [4]–[13]. In recent times, however, a new

approach of using color information has been introduced.

In this paper, we will discuss the color analysis approach to

face segmentation. The discussion includes the derivation of

a universal model of human skin color, the use of appropriate

color space, and the limitations of color segmentation. We then

present a practical solution to the face-segmentation problem.

This includes how to derive a robust skin-color reference map

and how to overcome the limitations of color segmentation. In

addition to face segmentation, one of its applications on video

coding will be presented in further detail. It will explain how

the face-segmentation results can be exploited by an existing

video coder so that it encodes the area of interest (i.e., the

facial region) with higher ﬁdelity and hence produces images

with better rendered facial features.

This paper is organized as follows. The color analysis

approach to face segmentation is presented in Section II.

In Section III, we present our contributions to this ﬁeld of

research, which include our proposed skin-color reference

map and methodology to face segmentation. The simulation

results of our proposed algorithm along with some discussion

is provided in Section IV. This is followed by Section V,

which describes a video coding technique that uses the face-

segmentation results. The conclusions and further research

directions are presented in Section VI.

II. COLOR ANALYSIS

The use of color information has been introduced to the

face-locating problem in recent years, and it has gained

increasing attention since then. Some recent publications that

have reported this study include [14]–[23]. They have all

shown, in one way or another, that color is a powerful descrip-

tor that has practical use in the extraction of face location.

The color information is typically used for region rather than

edge segmentation. We classify the region segmentation into

two general approaches, as illustrated in Fig. 1. One approach

is to employ color as a feature for partitioning an image into a

set of homogeneous regions. For instance, the color component

of the image can be used in the region growing technique, as

demonstrated in [24], or as a basis for a simple thresholding

technique, as shown in [23]. The other approach, however,

makes use of color as a feature for identifying a speciﬁc object

in an image. In this case, the skin color can be used to identify

the human face. This is feasible because human faces have a

special color distribution that differs signiﬁcantly (although

not entirely) from those of the background objects. Hence

this approach requires a color map that models the skin-color

distribution characteristics.

Fig. 1. The use of color information for region segmentation.

Fig. 2. Foreman image with a white contour highlighting the facial region.

The skin-color map can be derived in two ways on account

of the fact not all faces have identical color features. One

approach is to predeﬁne or manually obtain the map such that

it suits only an individual color feature. For example, here we

obtain the skin-color feature of the subject in a standard head-

and-shoulders test image called Foreman. Although this is a

color image in YCrCb format, its gray-scale version is shown

in Fig. 2. The ﬁgure also shows a white contour highlighting

the facial region. The histograms of the color information (i.e.,

Cr and Cb values) bounded within this contour are obtained

as shown in Fig. 3. The diagrams show that the chrominance

values in the facial region are narrowly distributed, which

implies that the skin color is fairly uniform. Therefore, this

individual color feature can simply be deﬁned by the presence

of Cr values within, say, 136 and 156, and Cb values within

110 and 123. Using these ranges of values, we managed to

locate the subject’s face in another frame of Foreman and also

in a different scene (a standard test image called Carphone), as

can be seen in Fig. 4. This approach was suggested in the past

by Li and Forchheimer in [14]; however, a detailed procedure

on the modeling of individual color features and their choice

of color space was not disclosed.

In another approach, the skin-color map can be designed

by adopting histograming technique on a given set of training

CHAI AND NGAN: FACE SEGMENTATION USING SKIN-COLOR MAP 553

Fig. 3. Histograms of Cr and Cb components in the facial region.

Fig. 4. Foreman and Carphone images, and their color segmentation results,

obtained by using the same predeﬁned skin-color map.

data and subsequently used as a reference for any human face.

Such a method was successfully adopted by the authors [21],

[25], Sobottka and Pitas [18], and Cornall and Pang [22].

Among the two approaches, the ﬁrst is likely to produce

better segmentation results in terms of reliability and accuracy

by virtue of using a precise map. However, it is realized

at the expense of having a face-segmentation process either

that is too restrictive because it uses a predeﬁned map or

requires human interaction to manually deﬁne the necessary

map. Therefore, the second approach is more practical and

appealing, as it attempts to cater to all personal color features

in an automatic manner, albeit in a less precise way. This,

however, raises a very important issue regarding the coverage

of all human races with one reference map. In addition, the

general use of a skin-color model for region segmentation

prompts two other questions, namely, which color space to

use and how to distinguish other parts of the body and

background objects with skin-color appearance from the actual

facial region.

A. Color Space

An image can be presented in a number of different color

space models.

•RGB: This stands for the three primary colors: red, green,

and blue. It is a hardware-oriented model and is well

known for its color-monitor display purpose.

•HSV: An acronym for hue-saturation-value. Hue is a color

attribute that describes a pure color, while saturation

deﬁnes the relative purity or the amount of white light

mixed with a hue; value refers to the brightness of the

image. This model is commonly used for image analysis.

•YCrCb: This is yet another hardware-oriented model.

However, unlike the RGB space, here the luminance is

separated from the chrominance data. The Y value repre-

sents the luminance (or brightness) component, while the

Cr and Cb values, also known as the color difference

signals, represent the chrominance component of the

image.

These are some, but certainly not all, of the color space models

available in image processing. Therefore, it is important to

choose the appropriate color space for modeling human skin

color. The factors that need to be considered are application

and effectiveness. The intended purpose of the face segmen-

tation will usually determine which color space to use; at the

same time, it is essential that an effective and robust skin-

color model can be derived from the given color space. For

instance, in this paper, we propose the use of the YCrCb color

space, and the reason is twofold. First, an effective use of the

chrominance information for modeling human skin color can

be achieved in this color space. Second, this format is typically

used in video coding, and therefore the use of the same,

instead of another, format for segmentation will avoid the

extra computation required in conversion. On the other hand,

both Sobottka and Pitas [18] and Saxe and Foulds [19] have

554 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 9, NO. 4, JUNE 1999

opted for the HSV color space, as it is compatible with human

color perception, and the hue and saturation components have

been reported also to be sufﬁcient for discriminating color

information for modeling skin color. However, this color space

is not suitable for video coding. Hunke and Waibel [15] and

Graf et al. [26] used a normalized RGB color space. The

normalization was employed to minimize the dependence on

the luminance values.

On this note, it is interesting to point out that unlike

the YCrCb and HSV color spaces, whereby the brightness

component is decoupled from the color information of the

image, in the RGB color space it is not. Therefore, Graf et al.

have suggested preprocessing calibration in order to cope with

unknown lighting conditions. From this point of view, the skin-

color model derived from the RGB color space will be inferior

to those obtained from the YCrCb or HSV color spaces. Based

on the same reasoning, we hypothesize that a skin-color model

can remain effective regardless of the variation of skin color

(e.g., black, white, or yellow) if the derivation of the model is

independent of the brightness information of the image. This

will be discussed in later sections.

B. Limitations of Color Segmentation

A simple region segmentation based on the skin-color map

can provide accurate and reliable results if there is a good

contrast between skin color and those of the background

objects. However, if the color characteristic of the background

is similar to that of the skin, then pinpointing the exact face

location is more difﬁcult, as there will be more falsely detected

background regions with skin-color appearance. Note that in

the context of face segmentation, other parts of the body are

also considered as background objects. There are a number of

methods to discriminate between the face and the background

objects, including the use of other cues such as motion and

shape.

Provided that the temporal information is available and

there is a priori knowledge of a stationary background and

no camera motion, motion analysis can be incorporated into

the face-localization system to identify nonmoving skin-color

regions as background objects. Alternatively, shape analysis

involving ellipse ﬁtting can also be employed to identify the

facial region from among the detected skin-color regions. It is

a common observation that the appearance of a human face

resembles an oval shape, and therefore it can be approximated

by an ellipse [2]. In this paper, however, we propose a set of

regularization processes that are based on the spatial distribu-

tion and the corresponding luminance values of the detected

skin-color pixels. This approach overcomes the restriction of

motion analysis and avoids the extensive computation of the

ellipse-ﬁtting method. The details will be discussed in the next

section along with our proposed method for face segmentation.

In addition to poor color contrast, there are other limitations

of color segmentation when an input image is taken in some

particular lighting conditions. The color process will encounter

some difﬁculty when the input image has:

• a “bright spot” on the subject’s face due to reﬂection of

intense lighting;

• a dark shadow on the face as a result of the use of strong

directional lighting that has partially blackened the facial

region;

• been captured with the use of color ﬁlters.

Note that these types of images (particularly in cases 1 and

2) are posing great technical challenges not only to the color

segmentation approach but also to a wide range of other face-

segmentation approaches, especially those that utilize edge

image, intensity image, or facial feature-points extraction.

However, we have found that the color analysis approach

is immune to moderate illumination changes and shading

resulting from a slightly unbalanced light source, as these

conditions do not alter the chrominance characteristics of the

skin-color model.

III. FACE-SEGMENTATION ALGORITHM

In this section, we present our methodology to perform

face segmentation. Our proposed approach is automatic in

the sense that it uses an unsupervised segmentation algorithm,

and hence no manual adjustment of any design parameter is

needed in order to suit any particular input image. Moreover,

the algorithm can be implemented in real time, and its un-

derlying assumptions are minimal. In fact, the only principal

assumption is that the person’s face must be present in the

given image, since we are locating and not detecting whether

there is a face. Thus, the input information required by the

algorithm is a single color image that consists of a head-and-

shoulders view of the person and a background scene, and the

facial region can be as small as only a 32 32 pixels window

(or 1%) of a CIF-size (352 288) input image. The format of

the input image is to follow the YCrCb color space, based on

the reason given in the previous section. The spatial sampling

frequency ratio of Y, Cr, and Cb is 4:1:1. So, for a CIF-size

image, Y has 288 lines and 352 pixels per line, while both Cr

and Cb have 144 lines and 176 pixels per line each.

The algorithm consists of ﬁve operating stages, as outlined

in Fig. 5. It begins by employing a low-level process like color

segmentation in the ﬁrst stage, then uses higher level opera-

tions that involve some heuristic knowledge about the local

connectivity of the skin-color pixels in the later stages. Thus,

each stage makes full use of the result yielded by its preceding

stage in order to reﬁne the output result. Consequently, all

the stages must be carried out progressively according to the

given sequence.

A detailed description of each stage is presented below.

For illustration purposes, we will use a studio-based head-

and-shoulders image called Miss America to present the in-

termediate results obtained from each stage of the algorithm.

This input image is shown in Fig. 6.

A. Stage One—Color Segmentation

The ﬁrst stage of the algorithm involves the use of color

information in a fast, low-level region segmentation process.

The aim is to classify pixels of the input image into skin color

and non-skin color. To do so, we have devised a skin-color

reference map in YCrCb color space.

CHAI AND NGAN: FACE SEGMENTATION USING SKIN-COLOR MAP 555

Fig. 5. Outline of face-segmentation algorithm.

Fig. 6. Input image of Miss America.

We have found that a skin-color region can be identiﬁed

by the presence of a certain set of chrominance (i.e., Cr and

Cb) values narrowly and consistently distributed in the YCrCb

color space. The location of these chrominance values has

been found and can be illustrated using the CIE chromaticity

diagram as shown in Fig. 7. We denote and as the

respective ranges of Cr and Cb values that correspond to skin

color, which subsequently deﬁne our skin-color reference map.

The ranges that we found to be the most suitable for all

the input images that we have tested are

and . This map has been proven, in our

experiments, to be very robust against different types of skin

color. Our conjecture is that the different skin color that we

perceived from the video image cannot be differentiated from

the chrominance information of that image region. So, a map

that is derived from Cr and Cb chrominance values will remain

effective regardless of skin-color variation (see Section IV for

the experimental results). Moreover, our intuitive justiﬁcation

for the manifestation of similar Cr and Cb distributions of

Fig. 7. Skin-color region in CIE chromaticity diagram.

skin color of all races is that the apparent difference in skin

color that viewers perceived is mainly due to the darkness or

fairness of the skin; these features are characterized by the

difference in the brightness of the color, which is governed by

Y but not Cr and Cb.

With this skin-color reference map, the color segmenta-

tion can now begin. Since we are utilizing only the color

information, the segmentation requires only the chrominance

component of the input image. Consider an input image of

pixels, for which the dimension of Cr and Cb therefore

is . The output of the color segmentation, and hence

stage one of the algorithm, is a bitmap of size,

described as

otherwise (1)

where and . The

output pixel at point is classiﬁed as skin color and set

to one if both the Cr and Cb values at that point fall inside

their respective ranges and . Otherwise, the pixel is

classiﬁed as non-skin color and set to zero. To illustrate this,

we perform color segmentation on the input image of Miss

America, and the bitmap produced can be seen in Fig. 8. The

output value of one is shown in black, while the value of zero

is shown in white (this convention will be used throughout

this paper).

Among all the stages, this ﬁrst stage is the most vital. Based

on our model of human skin color, the color segmentation

has to remove as many pixels as possible that are unlikely to

belong to the facial region while catering for a wide variety of

skin color. However, if it falsely removes too many pixels that

belong to the facial region, then the error will propagate down

the remaining stages of the algorithm, consequently causing a

failure to the entire algorithm.

Nevertheless, the result of color segmentation is the detec-

tion of pixels in a facial area and may also include other areas

556 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 9, NO. 4, JUNE 1999

Fig. 8. Bitmap produced by stage one.

where the chrominance values coincide with those of the skin

color (as is the case in Fig. 8). Hence the successive operating

stages of the algorithm are used to remove these unwanted

areas.

B. Stage Two—Density Regularization

This stage considers the bitmap produced by the previous

stage to contain the facial region that is corrupted by noise.

The noise may appear as small holes on the facial region

due to undetected facial features such as eyes and mouth, or

it may also appear as objects with skin-color appearance in

the background scene. Therefore, this stage performs simple

morphological operations such as dilation to ﬁll in any small

hole in the facial area and erosion to remove any small object

in the background area. The intention is not necessarily to

remove the noise entirely but to reduce its amount and size.

To distinguish between these two areas, we ﬁrst need to

identify regions of the bitmap that have higher probability

of being the facial region. The probability measure that we

used is derived from our observation that the facial color is

very uniform, and therefore the skin-color pixels belonging

to the facial region will appear in a large cluster, while the

skin-color pixels belonging to the background may appear as

large clusters or small isolated objects. Thus, we study the

density distribution of the skin-color pixels detected in stage

one. An array of density values, called density

map , is computed as

(2)

where and .It

ﬁrst partitions the output bitmap of stage one into

nonoverlapping groups of 4 4 pixels,then counts the number

of skin-color pixels within each group and assigns this value

to the corresponding point of the density map.

According to the density value, we classify each point into

three types, namely, zero ( ), intermediate (0

16), and full ( ). A group of points with zero density

value will represent a nonfacial region, while a group of full-

density points will signify a cluster of skin-color pixels and a

high probability of belonging to a facial region. Any point

of intermediate density value will indicate the presence of

Fig. 9. Density map after classiﬁcation.

Fig. 10. Bitmap produced by stage two.

noise. The density map of Miss America with the three density

classiﬁcations is depicted in Fig. 9. The point of zero density is

shown in white, intermediate density in gray, and full density

in black.

Once the density map is derived, we can then begin the

process that we termed as density regularization. This involves

the following three steps.

1) Discard all points at the edge of the density map,

i.e., set for all and

2) Erode any full-density point (i.e., set to zero) if it is

surrounded by less than ﬁve other full-density points in

its local 3 3 neighborhood.

3) Dilate any point of either zero or intermediate density

(i.e., set to 16) if there are more than two full-density

points in its local 3 3 neighborhood.

After this process, the density map is converted to the output

bitmap of stage two as

otherwise (3)

for all and .

The result of stage two for the Miss America image is

displayed in Fig. 10. Note that this bitmap is now four times

lower in spatial resolution than that of the output bitmap in

stage one.

CHAI AND NGAN: FACE SEGMENTATION USING SKIN-COLOR MAP 557

Fig. 11. Standard deviation values of the detected pixels in .

C. Stage Three—Luminance Regularization

We have found that in a typical videophone image, the

brightness is nonuniform throughout the facial region, while

the background region tends to have a more even distribution

of brightness. Hence, based on this characteristic, background

region that was previously detected due to its skin-color

appearance can be further eliminated.

The analysis employed in this stage involves the spatial

distribution characteristic of the luminance values since they

deﬁne the brightness of the image. We use standard deviation

as the statistical measure of the distribution. Note that the size

of the previously obtained bitmap is ;

hence each point corresponds to a group of 8 8 luminance

values, denoted by , in the original input image. For

every skin-color pixel in , we calculate the standard

deviation, denoted as , of its corresponding group of

luminance values, using

(4)

Fig. 11 depicts the standard deviation values calculated for

the Miss America image.

If the standard deviation is below a value of two, then the

corresponding 8 8 pixels region is considered too uniform

and therefore unlikely to be part of the facial region. As a

result, the output bitmap of stage three, denoted as ,

is derived as

if and

otherwise (5)

for all and . The output

bitmap of this stage for the Miss America image is presented

in Fig. 12. The ﬁgure shows that a signiﬁcant portion of the

unwanted background region was eliminated at this stage.

Fig. 12. Bitmap produced by stage three.

D. Stage Four—Geometric Correction

We performed a horizontal and vertical scanning process to

identify the presence of any odd structure in the previously

obtained bitmap, , and subsequently removed it. This

is to ensure that a correct geometric shape of the facial region

is obtained. However, prior to the scanning process, we will

attempt to further remove any more noise by using a technique

similar to that initially introduced in stage two. Therefore,

a pixel in with the value of one will remain as a

detected pixel if there are more than three other pixels, in

its local 3 3 neighborhood, with the same value. At the

same time, a pixel in with a value of zero will be

reconverted to a value of one (i.e., as a potential pixel of the

facial region) if it is surrounded by more than ﬁve pixels, in its

local 3 3 neighborhood, with a value of one. These simple

procedures will ensure that noise appearing on the facial region

is ﬁlled in and that isolated noise objects on the background

are removed.

We then commence the horizontal scanning process on the

“ﬁltered” bitmap. We search for any short continuous run of

pixels that are assigned with the value of one. For a CIF-

size image, the threshold for a group of connected pixels to

558 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 9, NO. 4, JUNE 1999

Fig. 13. Bitmap produced by stage four.

Fig. 14. Bitmap produced by stage ﬁve.

belong to the facial region is four. Therefore, any group of less

than four horizontally connected pixels with the value of one

will be eliminated and assigned to zero. A similar process is

then performed in the vertical direction. The rationale behind

this method is that, based on our observation, any such short

horizontal or vertical run of pixels with the value of one is

unlikely to be part of a reasonable-size and well-detected

facial region. As a result, the output bitmap of this stage

should contain the facial region with minimal or no noise,

as demonstrated in Fig. 13.

E. Stage Five—Contour Extraction

In this ﬁnal stage, we convert the output

bitmap of stage four back to the dimension of .

To achieve the increase in spatial resolution, we utilize the

edge information that is already made available by the color

segmentation in stage one. Therefore, all the boundary points

in the previous bitmap will be mapped into the corresponding

group of 4 4 pixels with the value of each pixel as deﬁned

in the output bitmap of stage one. The representative output

bitmap of this ﬁnal stage of the algorithm is shown in Fig. 14.

IV. SEGMENTATION RESULTS

The proposed skin-color reference map is intended to work

on a wide range of skin color, including that of people of

European, Asian, and African decent. Therefore, to show that

it works on subject with skin color other than white (as is the

case with the Miss America image), we have used the same

map to perform the color-segmentation process on subjects

with black and yellow skin color. The results obtained were

very good, as can be seen in Fig. 15. The skin-color pixels

were correctly identiﬁed, in both input images, with only a

small amount of noise appearing, as expected, in the facial

regions and background scenes, which can be removed by the

remaining stages of the algorithm.

Fig. 15. Results produced by the color-segmentation process in stage one

and the ﬁnal output of the face segmentation algorithm.

We have further tested the skin-color map with 30 samples

of images. Skin colors were grouped into three classes: white,

yellow, and black. Ten samples, each of which contained the

facial region of a different subject captured in a different

lighting condition, were taken from each class to form the

test set. We have constructed three normalized histograms for

each sample in the separate Y, Cr, and Cb components. The

normalization process was used to account for the variation

of facial-region size in each sample. We have then taken the

average results from the ten samples of each class. These

average normalized histogram results are presented in Fig. 16.

Since all samples were taken from different and unknown

lighting conditions, the histograms of the Y component for all

three classes cannot be used to verify whether the variations

of luminance values in these image samples were caused by

the different skin color or by the different lighting condi-

tions. However, the use of such samples illustrated that the

variation in illumination does not seem to affect the skin-

color distribution in the Cr and Cb components. On the other

hand, the histograms of Cr and Cb components for all three

classes clearly showed that the chrominance values are indeed

narrowly distributed, and more important, that the distributions

are consistent across different classes. This demonstrated that

an effective skin-color reference map could be achieved based

on the Cr and Cb components of the input image.

The face-segmentation algorithm with this universal skin-

color reference map was tested on many head-and-shoulders

images. Here we emphasize that the face-segmentation process

was designed to be completely automatic, and therefore the

same design parameters and rules (including the reference

skin-color map and the heuristic) as described in the previous

section were applied to all the test images. The test set now

contained 20 images from each class of skin color. Therefore,

a total of 60 images of different subjects, background com-

plexities, and lighting conditions from the three classes were

CHAI AND NGAN: FACE SEGMENTATION USING SKIN-COLOR MAP 559

(a)

(b)

(c)

Fig. 16. Histograms of Y, Cr, and Cb values of different facial skin colors: (a) white, (b) yellow, and (c) black.

used. Using this test set, a success rate of 82% was achieved.

The algorithm has performed successful segmentation of 49

out of 60 faces. Out of the 11 unsuccessful cases, seven cases

have incorrect localization, two have partial localization, and

two have both incorrect and partial localization.

The representative results shown in Fig. 17 illustrated the

successful face segmentation achieved by the algorithm on two

images with different background complexities. The edges of

the facial regions were accurately obtained with no noise’s

appearing on either the facial region or the background.

Moreover, the results were obtained in real time, as it took

a SunSPARC 20 computer less than 1 s to perform all

computations required on a CIF-size input image.

In all seven incorrect localization cases, the segmentation re-

sults did contain the complete facial regions but also included

some background regions. In four out of seven, the subject’s

hair, which is considered as a background region, was falsely

identiﬁed as a facial region. Partial localization occurred in

two cases and resulted in the localization of an incomplete

facial region. These cases were caused by thick facial hair,

i.e., mustache and beard. The two cases with both incorrect

and partial localization have facial regions partially localized,

and the results also contained some background regions.

Note that in all cases, the facial regions were always located,

whether completely or partially.

V. CODING

Here, we describe a video coding technique, termed a

foreground/background (FB) coding scheme, that uses the

face-segmentation results to code the area of interest with

better quality. In applications such as videotelephony, the face

of the speaker is typically the most important image region for

the viewer. Therefore, the face-segmentation algorithm is used

to separate the facial area from its background scene to become

560 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 9, NO. 4, JUNE 1999

Fig. 17. Segmented facial regions and remaining background scenes.

the foreground region. Here, we propose to use the classical

block-based video coding system. To be consistent with many

of the video coding standards [27]–[30], the foreground and

background regions will only need to be identiﬁed at the

macroblock (MB) level.

In the FB encoding process, we allocate fewer bits for

encoding the background MB’s by using a higher quantization

level. In doing so, we free up more bits that can then be used

for encoding the foreground MB’s. This bit transfer leads to

a better quality encoded area of interest at the expense of

having a lower quality background image. This is based on

the premise that the background is usually of less signiﬁcance

to the viewer’s perception, so the overall subjective quality

of the image is perceptively improved and more pleasing to

the viewer.

This concept was initially proposed by us in [1], where we

introduced the FB coding scheme and its implementation as

an additional encoding option for the H.263 codec [30]. In this

paper, however, we will use the H.261 codec.

A. H.261FB

We have integrated the FB coding scheme into the well-

known H.261 video coding system [29]. Hereafter, we term

this approach H.261FB. The H.261FB coder utilizes the in-

formation obtained from the face-segmentation algorithm, as

described in Section III, to enable bit transfer between the

foreground and background MB’s. This redistribution of bit

allocation is simply attained by controlling the quantization

level in a discriminatory manner. In addition, a new rate-

control strategy is devised in order to regulate the bitstream

produced by this discriminatory quantization process.

This approach will still produce a bitstream that conforms

to the H.261 standard. The reason is that the new quantization

process does not involve any modiﬁcation to the bitstream

syntax; it merely assigns two different values to two different

regions. As for the rate control, there is no standardized

technique. Hence the manufacturers of the encoder have the

freedom to devise their own strategy. Moreover, we do not

need to transmit the segmentation information to the decoder,

as it is used in the encoder only. Therefore, the integra-

tion is supported by the syntax, and a full H.261 decoder

compatibility is maintained.

B. Discriminatory Quantization Process

Two quantizers, instead of one, are used in the H.261FB

approach. We assigned and to be the quantizers for the

foreground (FG) and background (BG) MB’s, respectively.

Among the two, is a ﬁner quantizer, while is a coarser

one. H.261FB uses the MQUANT header to switch between

these two quantizers, as shown in (6). The MQUANT header

is a ﬁxed-length code word of ﬁve bits that indicates the

quantization level to be used for the current MB. Hence this

5-bit code word represents a range of quantization levels from

1to31

MQUANT if current MB belongs to FG

if current MB belongs to BG. (6)

It is not necessary, however, for the encoder to send this

header for every MB. The transmission of the MQUANT

header is only required in one of the following cases:

1) when the current MB is in a different region from the

previously encoded MB, i.e., a change from foreground

to background MB or vice versa;

2) when the rate-control algorithm updates the quantization

level in order to maintain a constant bit rate.

Naturally, this approach has to sustain a slight increase in

the transmission of an MQUANT header. However, the beneﬁt

easily outweighs this overhead cost, as will be demonstrated

in the simulation results.

C. Rate-Control Function

A new rate-control strategy is needed to adjust not one

but now two quantizers periodically in order to regulate

the bit rate. To do so, the quantizer can be adjusted as

follows. The quantization parameter (or level) assigned to

the quantizer can be deﬁned as a simple function of buffer

contents. Mathematically, the quantization parameter QP can

be expressed as

QP BufferContents (7)

where is the quantization division factor of the buffer

and is the offset factor. The BufferContents variable

indicates how much data (in unit of bits) is currently stored

in the buffer.

According to the RM8 coder [31] (a reference implementa-

tion of the H.261 coder, developed by the standardization study

CHAI AND NGAN: FACE SEGMENTATION USING SKIN-COLOR MAP 561

group), is set to one to avoid zero quantization, while

is equal to the target Bitrate divided by a constant

value of 320, i.e.,

Bitrate (8)

where Bitrate kbits/s, . Hence for the

RM8 coder, the next quantization parameter is determined by

the function described as

QP BufferContents (9)

The value of QP is clipped at 31 because the MQUANT header

is a ﬁxed-length code word of ﬁve bits. As the BufferContents

increases, QP also increases in order to offset any rise in bit

rate. The value of QP will remain at the maximum of 31 until

the buffer is full, which takes place when the BufferContents

variable reaches the maximum capacity of the buffer. When

the BufferContents variable exceeds the buffer size, buffer

overﬂow is said to occur. In such an event, the macroblock

is skipped (i.e., not transmitted), and as a result, quantization

is no longer needed.

In the H.261FB approach, two similar rate-control functions

as mentioned above are used—one for the foreground region

and another for the background. Each function will have

different values of and . For instance, we can

set to a higher value such that the function forces the

quantizer to always adopt a coarser quantization parameter.

Therefore, the amount of bit transfer between foreground and

background MB’s is mainly determined by the value of

being assigned to their respective rate-control functions. On

the other hand, the offset factor governs how the bits

are distributed within the same region.

Here, we choose (9), the function deﬁned in RM8, for the

foreground region [see Fig. 18(a)]. As for the background

region, we shift to 15 and set to (30/16)

200 [see Fig. 18(b)]. This constrains the quantizer to a

minimum value of 15, while the clipping of the quantization

level to its maximum value will occur at the same level of

buffer occupancy as in the case of RM8.

D. Coding Results

The FB coding scheme is demonstrated on the CIF Foreman

video sequence. First, we used our proposed face-segmentation

algorithm to separate each frame of the input sequence into

foreground and background MB’s. The results for the ﬁrst

frame of the sequence are shown in Fig. 19(a) and (b).

We then encoded the sequence with both the RM8 and

H.261FB coders. Note that, other than the use of the dis-

criminatory quantization process and the new rate-control

function as described in the previous section, the rest of the

implementation of the H.261FB coder is the same as for RM8.

To evaluate the discriminatory quantization process, we

performed intraframe coding on the ﬁrst frame. To provide a

(a)

(b)

Fig. 18. (a) Rate-control function used in the RM8 coder and (b) proposed

rate-control function for the background MB’s in the H.261FB coder.

fair comparison of image quality, the quantization parameters

were manually obtained so that both approaches consume

a similar amount of bits. Therefore, the quantizer for the

RM8 coder was ﬁxed at 22 throughout the entire encoding

processing. For the H.261FB coder, the foreground quantizer

and the background quantizer were set at 11 and

31 respectively. Overall, the RM8 coder spent an average of

105.81 bits per MB. Furthermore, we have identiﬁed that it

spent an average of 89.01 bits per MB in the foreground

region and 109.54 bits per MB in the background region.

The quality of the encoded image is shown in Fig. 19(c).

This is compared with the H.261FB-encoded image shown in

Fig. 19(d), whereby the coder spent an average of 134.72 bits

per foreground MB and 90.70 bits per background MB, while

its overall average bit per MB was 98.70. This overall amount

of bits used is about 7.11 bits per MB fewer than that of RM8,

and yet the ﬁgures clearly show that the area of interest is much

improved in the H.261FB-encoded image as a result of the bit

transfer from the background to foreground region, while its

degradation in the background region was hardly noticeable.

The improvement can be further illustrated by magnifying the

face region of the images as shown in Fig. 19(e) and (f).

To demonstrate the performance of our proposed rate-

control functions for the FB coding scheme, both the RM8

and H.261FB coders were used to encode 100 frames of the

Foreman sequence at a target bit rate of 192 kbits/s and frame

rate of 10 f/s. A plot displaying the bit rates achieved by both

coders is provided in Fig. 20. The simulation revealed that the

subjective quality of the H.261FB-coded images was much

better than that the RM8-coded images, and yet their bit rates

were slightly lower. We illustrate the improvement by showing

a representative frame 72 of the encoded images in Fig. 21.

It can be clearly observed that the H.261FB-coded image in

Fig. 21(b) has a better perceived quality and rendition of facial

features than the RM8-coded image shown in Fig. 21(a).

562 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 9, NO. 4, JUNE 1999

(a) (b)

(e) (f)

Fig. 19. (a) Foreground MB’s and (b) background MB’s (c) coded by RM8 and (d) coded by H.261FB. (e) Magniﬁed image of (c). (f) Magniﬁed image of (d).

VI. CONCLUDING REMARKS

The color analysis approach to face segmentation was

discussed. In this approach, the face location can be identiﬁed

by performing region segmentation with the use of a skin-

color map. This is feasible because human faces have a special

color distribution characteristic that differs signiﬁcantly from

those of the background objects. We have found that pixels

belonging to the facial region, of the image in YCrCb color

space, exhibit similar chrominance values. Furthermore, a

consistent range of chrominance values was also discovered

from many different facial images, which include people of

European, Asian, and African descent. This led us to the

derivation of a skin-color map that models the facial color

of all human races.

With this universal skin-color map, we classiﬁed pixels

of the input image into skin color and non-skin color.

CHAI AND NGAN: FACE SEGMENTATION USING SKIN-COLOR MAP 563

Fig. 20. Bit rates achieved by RM8 and H.261FB coders at a target bit rate of 192 kbits/s.

(a) (b)

Fig. 21. Frame 72 of the coded results in Fig. 20: (a) RM8 and (b) H.261FB.

Consequently, a bitmap is produced, containing the facial re-

gion that is corrupted by noise. The noise may appear as small

holes on the facial region due to undetected facial features, or

it may also appear as objects with skin-color appearance in the

background scene. To cope with this noise and, at the same

time, reﬁne the facial-region detection, we have proposed a set

of novel region-based regularization processes that are based

on the spatial distribution study of the detected skin-color

pixels and their corresponding luminance values. All the oper-

ations are unsupervised and low in computational complexity.

Our proposed face-segmentation methodology was imple-

mented and tested on many input images, each of which

contains the head-and-shoulders view of a person and a

complex background scene. A set of representative results

from our simulations was shown in this paper. The results

demonstrated that our algorithm can accurately segment out

the facial regions from a diverse range of images that includes

subjects with different skin colors and the presence of various

background complexities. Furthermore, the face segmentation

was done automatically and in real time.

The use of face segmentation for video coding in applica-

tions such as videotelephony was then presented. We described

a foreground/background video coding scheme that uses the

face-segmentation results to improve the perceptual quality of

the encoded image with better rendition of the facial features.

This technique involves bit transfer between the facial region

and the background. The redistribution of bit allocation is

controlled by a discriminatory quantization process. Then the

bitstream generated from this process is regularized by a new

rate-control strategy. We have integrated this approach into the

H.261 framework with success. Improved image quality was

obtained as shown by the simulation results in the paper.

Our future research will involve the use of temporal infor-

mation to assist in face localization and also for tracking. For

coding, a further study of the rate-control strategy, the use

of segmentation-assisted motion estimation, and the proposal

564 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 9, NO. 4, JUNE 1999

of coding the foreground and background regions at different

frame rates will be investigated.

REFERENCES

[1] D. Chai and K. N. Ngan, “Foreground/background video coding

scheme,” in Proc. IEEE Int. Symp. Circuits Syst., Hong Kong, June

1997, vol. II, pp. 1448–1451.

[2] A. Eleftheriadis and A. Jacquin, “Model-assisted coding of video

teleconferencing sequences at low bit rates,” in Proc. IEEE Int. Symp.

Circuits Syst., London, U.K., June 1994, vol. 3, pp. 177–180.

[3] K. Aizawa and T. Huang, “Model-based image coding: Advanced video

coding techniques for very low-rate applications,” Proc. IEEE, vol. 83,

p. 259–271, Feb. 1995.

[4] V. Govindaraju, D. B. Sher, R. K. Srihari, and S. N. Srihari, “Locating

human faces in newspaper photographs,” in Proc. IEEE Computer Vision

Pattern Recognition Conf., San Diego, CA, June 1989, pp. 549–554.

[5] G. Sexton, “Automatic face detection for videoconferencing,” in Proc.

Inst. Elect. Eng. Colloquium Low Bit Rate Image Coding, May 1990,

pp. 9/1–9/3.

[6] V. Govindaraju, S. N. Srihari, and D. B. Sher, “A computational model

for face location,” in Proc. Int. Conf. Computer Vision, Dec. 1990, pp.

718–721.

[7] H. Li, “Segmentation of the facial area for videophone applications,”

Electron. Lett., vol. 28, pp. 1915–1916, Sept. 1992.

[8] S. Shimada, “Extraction of scenes containing a speciﬁc person from

image sequences of a real-world scene,” in Proc. IEEE TENCON’92,

Melbourne, Australia, Nov. 1992, pp. 568–572.

[9] M. Menezes de Sequeira and F. Pereira, “Knowledge-based videotele-

phone sequence segmentation,” in Proc. SPIE Visual Commun. and

Image Processing, vol. 2094, Nov. 1993, pp. 858–869.

[10] G. Yang and T. S. Huang, “Human face detection in a complex

background,” Pattern Recognit., vol. 27, no. 1, pp. 53–63, Jan. 1994.

[11] A. Eleftheriadis and A. Jacquin, “Automatic face location detection and

tracking for model-assisted coding of video teleconferencing sequences

at low-rates,” Signal Process. Image Commun., vol. 7, nos. 4–6, pp.

231–248, Nov. 1995.

[12] J. Luo, C. W. Chen, and K. J. Parker, “Face location in wavelet-based

video compression for high perceptual quality videoconferencing,” IEEE

Trans. Circuits Syst. Video Technol., vol. 6, pp. 411–414, Aug. 1996.

[13] T. F. Cootes and C. J. Taylor, “Locating faces using statistical feature

detectors,” in Proc. Int. Conf. Automatic Face and Gesture Recognition,

Killington, VT, Oct. 1996, pp. 204–209.

[14] H. Li and R. Forchheimer, “Location of face using color cues,” in Proc.

Picture Coding Symp., Lausanne, Switzerland, Mar. 1993, paper 2.4.

[15] M. Hunke and A. Waibel, “Face locating and tracking for human-

computer interaction,” in Proc. Conf. Signals, Syst. and Computers, Nov.

1994, vol. 2, pp. 1277–1281.

[16] S. Matsuhashi, O. Nakamura, and T. Minami, “Human-face extraction

using modiﬁed HSV color system and personal identiﬁcation through

facial image based on isodensity maps,” in Proc. Conf. Electrical

and Computer Engineering, Montreal, P.Q., Canada, 1995, vol. 2, pp.

909–912.

[17] Q. Chen, H. Wu, and M. Yachida, “Face detection by fuzzy pattern

matching,” in Proc. Int. Conf. Computer Vision, Cambridge, MA, June

1996, pp. 591–596.

[18] K. Sobottka and I. Pitas, “Face localization and facial feature extraction

based on shape and color information,” in Proc. IEEE Int. Conf. Image

Processing, Sept. 1996, vol. III, pp. 483–486.

[19] D. Saxe and R. Foulds, “Toward robust skin identiﬁcation in video

images,” in Proc. Int. Conf. on Automatic Face and Gesture Recognition,

Killington, VT, Oct. 1996, pp. 379–384.

[20] R. Kjeldsen and J. Kender, “Finding skin in color images,” in Proc. Int.

Conf. Automatic Face and Gesture Recognition, Vermont, Oct. 1996,

pp. 312–317.

[21] D. Chai and K. N. Ngan, “Automatic face location for videophone

images,” in Proc. IEEE TENCON’96, Perth, Australia, Nov. 1996, vol.

1, pp. 137–140.

[22] T. Cornall and K. Pang, “The use of facial color in image segmenta-

tion,” in Proc. Australia Telecommun. Networks and Applications Conf.,

Melbourne, Australia, Dec. 1996, pp. 351–356.

[23] Y. J. Zhang, Y. R. Yao, and Y. He, “Automatic face segmentation using

color cues for coding typical videophone scenes,” in Proc. SPIE Visual

Commun. and Image Processing, San Jose, CA, Feb. 1997, vol. 3024,

pp. 468–479.

[24] M. J. T. Reinders, P. J. L. van Beek, B. Sankur, and J. C. A. van

der Lubbe, “Facial feature localization and adaptation of a generic face

model for model-based coding,” Signal Process. Image Commun., vol.

7, no. 1, pp. 57–74, Mar. 1995.

[25] D. Chai and K. N. Ngan, “Extraction of VOP from videophone scene,”

in Proc. VLBV’97 Conf., Link¨

oping, Sweden, July 1997, pp. 45–48.

[26] H. P. Graf, E. Cosatoo, D. Gibbon, M. Kocheisen, and E. Petajan,

“Multi-modal system for locating heads and faces,” in Proc. Int. Conf.

Automatic Face and Gesture Recognition, Killington, VT, Oct. 1996,

pp. 88–93.

[27] “Information technology—Coding of moving pictures and associated

audio—For digital storage media up to about 1.5 Mbits/s—CD 11172,”

ISO/IEC MPEG, Dec. 1991.

[28] “Information technology—General coding of moving pictures and asso-

ciated audio information: Video,” Draft Int. Standard, ISO/IEC 13818-2,

ITU-T Rec. H.262, Nov. 1994.

[29] “Video coder for audiovisual services at 64 kbit/s,” ITU-T Rec.

H.261, Mar. 1993.

[30] “Video coding for low bitrate communication,” ITU-T Rec. H.263, May

1996.

[31] CCITT Study Group XV, “Document 525, description of reference

model (RM8),” June 9, 1989.

Douglas Chai (S’91) was born in Kuching, Malaysia, in 1973. He received

the ﬁrst class honors degree in electrical and electronic engineering from the

University of Western Australia, Australia, in 1994, where he currently is

pursuing the Ph.D. degree with the visual communications research group.

His research interests are in image compression, video coding, image

segmentation, and facial image analysis.

Mr. Chai received the Australian Postgraduate Award and the Telstra

Research Laboratories Postgraduate Fellowship Award.

King N. Ngan (M’79–SM’91), for a photograph and biography, see p. 3 of

the February 1999 issue of this TRANSACTIONS.

A color attention mechanism based on YES color space for skin segmentation

Article

Full-text available

May 2023

Skin segmentation plays an important role in image processing and human–computer interaction tasks. However, it is a challenging task to accurately detect skin regions from various scenes with different illumination or color styles. In addition, in the field of video processing, reducing the computational load and improving the real-time performance of the algorithm has also become an important topic of skin segmentation. Existing deep semantic segmentation networks usually pay too much attention to the detection performance of the model and make the model structure tend to be complex, which brings heavy computational burden. To achieve the trade-off between detection performance and real-time performance of the skin segmentation algorithm, this paper proposes a lightweight skin segmentation network. Compared with existing semantic segmentation networks, this model adopts a simpler structure to improve the real-time performance. In addition, to improve the feature fitting ability of the network without slowing down its inference speed, this paper proposes a color attention mechanism, which locates skin regions in images based on the distribution features of skin colors on the E-R/G color plane generated from the YES color space, and guides the network to update parameters. Experimental results show that this method not only exhibits similar detection performance to existing semantic segmentation networks such as U-Net and DeepLab, but also the computation load of the model is 18.1% lower than Fast-SCNN.

Exploring the Feasibility of Vision-Based Non-Contact Oxygen Saturation Estimation: Considering Critical Color Components and Individual Differences

Article

Full-text available

May 2024

The blood oxygen saturation, which indicates the ratio of oxygenated hemoglobin to total hemoglobin in the blood, is closely related to one’s health status. Oxygen saturation is typically measured using a pulse oximeter. However, this method can cause skin irritation, and in situations where there is a risk of infectious diseases, the use of such contact-based oxygen saturation measurement devices can increase the risk of infection. Therefore, recently, methods for estimating oxygen saturation using facial or hand images have been proposed. In this paper, we propose a method for estimating oxygen saturation from facial images based on a convolutional neural network (CNN). Particularly, instead of arbitrarily calculating the AC and DC components, which are essential for measuring oxygen saturation, we directly utilized signals obtained from facial images to train the model and predict oxygen saturation. Moreover, to account for the time-consuming nature of accurately measuring oxygen saturation, we diversified the model inputs. As a result, for inputs of 10 s, the Pearson correlation coefficient was calculated as 0.570, the mean absolute error was 1.755%, the root mean square error was 2.284%, and the intraclass correlation coefficient was 0.574. For inputs of 20 s, these metrics were calculated as 0.630, 1.720%, 2.219%, and 0.681, respectively. For inputs of 30 s, they were calculated as 0.663, 2.142%, 2.612%, and 0.646, respectively. This confirms that it is possible to estimate oxygen saturation without calculating the AC and DC components, which heavily influence the prediction results. Furthermore, we analyzed how the trained model predicted oxygen saturation through ‘SHapley Additive exPlanations’ and found significant variations in the feature contributions among participants. This indicates that, for more accurate predictions of oxygen saturation, it may be necessary to individually select appropriate color channels for each participant.

Semantic hand gesture integration system using self-co-articulation and movement epenthesis detection

Article

Full-text available

May 2024
VISUAL COMPUT

Recognizing hand gestures poses a formidable challenge, particularly when dealing with semantic gestures that require disentanglement prior to recognition. This paper addresses the intricate issue of an additional stroke, commonly referred to as ‘movement epenthesis stroke,’ which emerges between continuous gestures. Our proposed system employs a multifaceted approach to tackle this challenge. Initially, the system extracts color-motion information to facilitate hand detection, subsequently employing a fusion of shape information and a modified Kanade–Lucas–Tomasi (KLT) feature tracker. This integration significantly mitigates the issue of occlusions. The identification of movement epenthesis is accomplished by analyzing the gesture trajectory using a speed profile. Furthermore, self-co-articulation strokes are discerned by leveraging slope-angle information. To enhance the recognition process, a carefully selected set of 40 features is extracted, which are then employed for recognizing the resulting meaningful gestures. These features serve as inputs to various classification models, including support vector machines (SVM), k-nearest neighbors (kNN), and extreme learning machines (ELM). Deep learning algorithms are judiciously deployed to recognize gesture trajectories, thus streamlining the time-consuming feature extraction process. The outcomes of individual classifiers are amalgamated, resulting in a classifier fusion model. This model is enhanced through majority voting and is used in conjunction with cross-validation results. The experimental analysis culminates in an impressive accuracy rate of 98.88% achieved by the classifier fusion model. This achievement surpasses the performance of individual classifiers, underscoring the effectiveness of our proposed methodology.

Virtual Trial Room with Computer Vision and Machine Learning

Conference Paper

Full-text available

May 2023

Online shopping has revolutionized the retail industry, providing customers with convenience and accessibility. However, customers often hesitate to purchase wearable products such as watches, jewelry, glasses, shoes, and clothes due to the lack of certainty regarding fit and suitability. This leads to significant return rates, causing problems for both customers and vendors. To address this issue, A platform called the Virtual Trial Room with Computer Vision and Machine Learning is designed which enables customers to easily check whether a product will fit and suit them or not. To achieve this, an AI-generated 3D model of the human head was created from a single 2D image using DECA model. This 3D model was then superimposed with a custom-made 3D model of glass which is based on real-world measurements and fitted over the human head. To replicate the real-world look and feel, the model was retouched with textures, lightness, and smoothness. Furthermore, A full-stack application was developed utilizing various technologies such as HTML, CSS, JavaScript, React, Babylon.js, etc. This application enables users to view 3D-generated results on website, providing an immersive and interactive experience. In summary, Virtual Trial Room with Computer Vision and Machine Learning platform provides a sophisticated solution to the problem of online shopping for eye glasses. By utilizing advanced technology, main aim of this project is to significantly reduce return rates and enhance the overall customer experience.

A Novel Rule-Based Skin Detection Method using Principal Component Analysis-Based Dimensionality Reduction and Individual Contribution on Principal Components

Conference Paper

Dec 2023

Skin detection plays a vital role in various humanrelated computer vision applications, including human-computer interaction, medical diagnostic tools, and web content filtering. However, accurate skin detection remains challenging due to different factors such as luminosity variations, complex backgrounds, and diversity in skin tones. In this paper we present a rule-based skin detection method that applies dimensionality reduction using Principal Component Analysis (PCA) on pixels represented by multiple color channels. This process retains only the most pertinent information in form of principal components. Subsequently, skin detection is achieved according to the individual contribution of the pixels along these principal components. To evaluate the effectiveness of our approach, we conducted comprehensive experiments on the SFA dataset. Our method demonstrated consistently superior skin detection performance compared to other rule-based methods, in both quantitative and qualitative aspects across diverse scenarios.

Keyboard and Mouse Free Music Controller

Article

Full-text available

Feb 2024

Human-machine interface (HMI) is a crucial area of research as gestures have the potential to efficiently control and interact with computers. Many applications for hand detection have been created as a result of the pervasive use of built-in cameras in computers, smartphones, and tablets. For the majority of users, however, many of these are not useful. A straightforward concept for a keyboard- and mouse-free music controller is presented in this research. Using MATLAB code that integrates skin detection, area labelling, erosion, dilation, and motion differentiation, a music player controller is developed using the real-time frame tracking feature of a camera. Three hand detection algorithms are created and assessed for maximum performance and accuracy. Real-time hand detection for operating the music player is provided by the algorithm, which is created with efficiency and speed in mind.

Contactless HR Measurement from Facial Videos Using Alternative Color Spaces with CEEMDAN

Conference Paper

Dec 2023

Human skin detection: An unsupervised machine learning way

Article

Feb 2024
J VIS COMMUN IMAGE R

A Review of Vision-Based Hand Action Recognition Techniques

Conference Paper

Jul 2023

Visible Joint Classification and Temporal Segment Matching based 3D Pose Refinement for Volleyball Receive Analysis

Article

Full-text available

Jun 2023

Receive action pose analysis is very meaningful in volleyball games for training and strategy. Because receive action is lack high-quality labeled data sets and with problems like occlusion, body overlap, and abnormal pose. Conventional work fails to obtain high accurate pose results. This paper proposes visible joint refinement and receive action template matching for volleyball receive action analysis. Firstly, the visible joint is using pixel color feature and the potential constraint features between space and joints to classify the visible joint and refine the error visible joint. Secondly, the template is to realize the pose segment refinement at 3D level by matching. It is based on a multi-view system for a real volleyball competition scene. The dataset video is from the Game of 2014 Japan Inter High School of Men Volleyball. The experiment result achieves 95.33 %, 96.92 %, and 98.43 % success rate at the 30 mm, 50 mm, and 70 mm error ranges.

Automatic face location detection and tracking for model-assisted coding of video teleconference sequences at low bit rates

Article

Full-text available

Sep 1995
SIGNAL PROCESS-IMAGE

We present a novel and practical way to integrate techniques from computer vision to low bit-rate coding systems for video teleconferencing applications. Our focus is to locate and track the faces of persons in typical head-and-shoulders video sequences, and to exploit the face location information in a ‘classical’ video coding/decoding system. The motivation is to enable the system to selectively encode various image areas and to produce psychologically pleasing coded images where faces are sharper. We refer to this approach as model-assisted coding. We propose a totally automatic, low-complexity algorithm, which robustly performs face detection and tracking. A priori assumptions regarding sequence content are minimal and the algorithm operates accurately even in cases of partial occlusion by moving objects. Face location information is exploited by a low bit-rate 3D subband-based video coder which uses both a novel model-assisted pixel-based motion compensation scheme, as well as model-assisted dynamic bit allocation with object-selective quantization. By transferring a small fraction of the total available bit-rate from the non-facial to the facial area, the coder produces images with better-rendered facial features. The improvement was found to be perceptually significant on video sequences coded at 96 kbps for an input luminance signal in CIF format. The technique is applicable to any video coding scheme that allows for fine-grain quantizer selection (e.g. MPEG, H.261), and can maintain full decoder compatibility.

Face location in wavelet-based video compression for high perceptual quality videoconferencing

Conference Paper

Full-text available

Jan 1995

We present a human face location technique based on contour extraction within the framework of a wavelet-based video compression scheme for videoconferencing applications. In addition to an adaptive quantization in which spatial constraints are enforced to preserve perceptually important information at low bit rates, semantic information of the human face is incorporated to design a hybrid compression scheme for videoconferencing, since the face is often the most important part and should be coded with high fidelity. The human face is detected based on contour extraction and feature point analysis. An approximate face mask is then used in the quantization of the decomposed subbands. At the same total bit rate, coarser quantization of the background enables the face region to be quantized finer and coded with a higher quality. Moreover, the resultant larger quantization noise in the background can be suppressed using an edge-preserving enhancement algorithm. Experimental results have shown that the perceptual image quality is greatly improved using the proposed scheme

Extraction of scenes containing a specific person from image sequences of a real-world scene

Article

Jan 1992

S. Shimada

A method of extracting the head area from image sequences of a real-world scene for the purpose of facial discrimination processing is proposed. The method is based on tracking of the head area by characteristic point matching and edge matching. Experiments show that the method can extract the head area from scenes of a person walking freely against a general background in variable lighting, a person coming into contact with other moving objects, and when the shape of the head changes

Automatic face segmentation using color cues for coding typical videophone scenes

Article

Jan 1997
Proceedings of SPIE

This paper presents a simple color segmentation technique which could be used in the model-based very low bit-rate coding approaches for videophone applications, in which the delimitation of the face of speaker is request. This work attempts to segment the face of speaker using color cues. To better take the advantage of the color contents of images, the color segmentation is carried out in HSI (hue, saturation, intensity) space with the three components used in two steps. The original image is first splitted into two groups of regions, one has higher saturation values and other has lower saturation values,b y using an adaptive threshold value applied to the histogram of saturation. In the high saturation regions, the hue component can furnish useful references for further segmentation, while in the low saturation regions the intensity component can play the similar role. For each group of regions, a multi- thresholding technique based on either hue or intensity component is then proposed for the subsequent segmentation. After both groups of regions are segmented, a combination of these two segmentation results will provide the finally segmented image. Some experiments with images taken from typical 'head-and-shoulders' videophone sequences are carried out and some results are presented.

Electronic Imaging '97

Article

Jan 1997

Visual Communications '93

Article

Oct 1993

This paper presents a robust knowledge-based segmentation algorithm for videotelephony sequences ranging from studio based to mobile. It is able to divide each image in a sequence in non-overlapping head, body, and background areas. Its robustness stems from its ability to cope with the peculiarities of mobile sequences, having very detailed, moving backgrounds as well as strong camera movements (originating from vibration in car videotelephones or from small hand movements in hand-held videotelephones). The proposed algorithm uses edge and changed areas (due to speaker's motion) detection, as well as the redundancy associated to the speaker's position, as the basis for the segmentation. Geometrical knowledge-based techniques are then used to define the complete regions. The algorithm includes a quality estimation and control procedure, which enables it to decide whether to accept or reject the current segmentation, and which can be input to the videotelephone coder.© (1993) COPYRIGHT SPIE--The International Society for Optical Engineering. Downloading of the abstract is permitted for personal use only.

Foreground/background video coding scheme

Conference Paper

Jul 1997

This paper presents the use of segmentation to improve the subjective quality of the sequence produced by very low bit rate video coding system. Through this approach, each frame of the source sequence is first segmented into two non-overlapping regions, namely foreground and background. These two regions are then encoded using the same coder but with different quantization step-sizes. In this way, the image quality of the foreground region can be improved at the expense of encoding the unimportant background region at lower quality. Currently, our work focuses on integrating this approach into the H.263 coder, primarily for the videotelephony application. In this paper, we describe the working implementation, and also demonstrate the improved subjective quality of the coded sequence achieved

Human face detection in complex background

Article

Jan 1994
PATTERN RECOGN

The human face is a complex pattern. Finding human faces automatically in a scene is a difficult yet significant problem. It is the first important step in a fully automatic human face recognition system. In this paper a new method to locate human faces in a complex background is proposed. This system utilizes a hierarchical knowledge-based method and consists of three levels. The higher two levels are based on mosaic images at different resolutions. In the lower level, an improved edge detection method is proposed. In this research the problem of scale is dealt with, so that the system can locate unknown human faces spanning a wide range of sizes in a complex black-and-white picture. Some experimental results are given.

Facial feature localization and adaptation of a generic face model for model-based coding

Article

Nov 1995
SIGNAL PROCESS-IMAGE

A method for the adaptation of a generic 3-D face model to an actual face in a head-and-shoulders scene is discussed, with application to video-telephony. The adaptation is carried out both on a global scale to reposition and resize the wire-frame, as well as on a local scale to mimic individual physiognomy. To this effect a hierarchical scheme is developed to extract the semantic features in the head-and-shoulders scene, such as silhouette, face, eyes and mouth, using a knowledge-based selection mechanism. These algorithms, which are to be an integral part of a general model-based image coder, are tested on typical videophone sequences.

A computational model for face location

Conference Paper

Jan 1990

The authors adopted a model-based approach, where the shape of the object is defined in terms of several mini-templates. The mini-templates are abstract descriptions of simple geometric features like arcs and corners. Relationships between mini-templates are not rigid. Rather, they are represented by springs that allow deformation of a template in terms of its size and orientation. Cost functionals are determined empirically. The authors expect their system to generate candidate regions in a given photograph associated with a rank of its goodness

Face segmentation using skin-color map in videophone applications

Abstract and Figures

Recommended publications

Automatic face location for videophone images

Locating facial region of a head-and-shoulders color image

Automatic face segmentation in YCrCb images

A novel face segmentation algorithm