Conference PaperPDF Available

Time-Series Physiological Data Balancing for Regression

June 2021

June 2021

DOI:10.1109/ICAICA52286.2021.9498128

Conference: 2021 IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA)

Authors:

Akira Uchiyama

Osaka University

Teruo Higashino

Osaka University

Content uploaded by Akira Uchiyama

Content may be subject to copyright.

Time-Series Physiological Data Balancing

for Regression

Hiroki Yoshikawa, Akira Uchiyama, Teruo Higashino

Graduate School of Information Science and Technology, Osaka University

{h-yoshikawa, uchiyama, higashino}@ist.osaka-u.ac.jp

Abstract—Many studies have shown the effectiveness of ma-

chine learning in estimating psychological or physiological states

using physiological data as input. However, it is ethically and

physically difﬁcult to collect a large amount of data without

bias in an uncontrolled environment. Speciﬁcally, the amount

of data in rare cases is especially small compared to common

data. Therefore, the distribution bias may cause overﬁtting in

machine learning. In this paper, we propose a SMOTE-based

method to alleviate the distribution bias by data augmentation

in the regression problem using a dataset containing time-series

physiological data. The effectiveness of the proposed method

was conﬁrmed for datasets of thermal sensation and core body

temperature collected in uncontrolled environments. The results

show that our method improves the performance of regression

models for minor cases with a bit of decline in the mean average

error.

Keywords—Machine learning, Data preprocessing, Health care,

Time-series, Regression

I. INTRODUCTION

Machine learning is one of the most commonly used ap-

proaches for a wide variety of applications, including health-

care. Many of the healthcare applications typically use time-

series data such as heart rate and body temperature [1], [2] to

estimate psychological or physiological states of humans using

machine learning algorithms. For training estimators, data

collection through real experiments is essential. The challenge

in data collection, especially in healthcare applications, is data

imbalance, i.e., the distribution of the dataset is non-uniform.

This is natural because minor cases do not frequently happen

in the real world. The imbalanced data causes classiﬁers to

be naturally biased towards the majority class, leading to the

performance degradation for important and interest minority

samples [3], [4]. This is because standard machine learning

methods usually seek the minimization of training errors. We

may be able to collect data in minor cases by designing exper-

iment protocols carefully for some researches. Nevertheless,

such data collection is limited in controlled environments.

Therefore, it is difﬁcult to collect data in minor cases in

uncontrolled environments for researches in the healthcare

domain.

To address this problem, researchers make use of a part of

the modeling workﬂow toolkit called data balancing [5]. The

data balancing is a way of pre-processing the data to increase

its size, diversity, and robustness of models. Oversampling and

undersampling are used for balancing the dataset by augment-

ing minor data and discarding major data, respectively. The

former approach is applied based on existing algorithms such

as Synthetic Minority Oversampling Technique (SMOTE) [4].

SMOTE augments the data in minor classes by interpolation

for classiﬁcation problems.

In the ﬁeld of physiological sensing, estimations of hu-

man states such as thermal sensation using machine learning

often regard the target problems as classiﬁcation problems

for simplicity [6], [7]. However, to achieve estimations of

human states with ﬁner granularity, regression is more appro-

priate than classiﬁcation because regression estimates numer-

ical values. Because regression problems output continuous

numerical values, data balancing algorithms for classiﬁcation

problems are not suitable for regression problems. For data

balancing in regression problems, some algorithms such as

SMOTER [8] were proposed based on SMOTE. SMOTER

divides the distribution of the numerical target values into the

major and minor values based on the relevance score of the

target value. After that, it applies undersampling for major

cases and oversampling for minor cases. The oversampling

strategy is designed for numerical values based on SMOTE. It

uses the weighted average between the target values of the two

minor cases based on the distance between them. Through the

above steps, the dataset for a regression problem is balanced.

However, SMOTER does not consider time-series feature

values commonly used in psychological or physiological state

estimation.

In this paper, we propose a data balancing method for

regression with time-series feature values based on SMOTER.

To consider the temporal dependency of time-series data, we

extend a distance function. To deﬁne the distance between

time-series samples, we use Dynamic Time Warping (DTW)

distance as used in TS SMOTE [9]. TS SMOTE is designed

to extend SMOTE to time-series data. Our method interpolates

synthetic time-series using the weighted average and the

DTW distance. Table I summarizes the difference between

our method and other balancing methods in terms of the

capability to deal with time-series features and target problem

types. As far as we know, our method is the ﬁrst to achieve

data balancing for regression problems with time-series feature

values.

For evaluation, we apply our method to two imbalanced

datasets with time-series feature values. The ﬁrst dataset is

the thermal sensation dataset, which consists of time-series

physiological data measured by a wristband sensor as a feature

value and thermal sensation vote (TSV) as a target value.

TABLE I

SUMMARY OF BALANCING METHODS.

Method time-series target problem

SMOTE No classiﬁcation

TS SMOTE Yes classiﬁcation

SMOTER No regression

Proposed method Yes regression

The second dataset is the core body temperature dataset. The

core body temperature is measured by a tympanic temperature

sensor during exercise. Feature values are measured by a

wristband sensor, chest strap sensor, and environmental sensor.

The results show that our method improves the performance

of regression models for minor cases with a little decline in

the mean average error.

II. RE LATE D WOR K

A. SMOTE-based Data Balancing

In order to deal with the imbalanced datasets, researchers

proposed data augmentation methods. The basic strategy is

pre-processing the data to increase its size and diversity [10].

SMOTE is a predominant data augmentation for classiﬁcation.

Chawla et al. [4] showed the advantages of this approach

compared to other alternative sampling techniques on several

real-world problems using several classiﬁcation algorithms.

Because of the advantage, methods derived from the SMOTE

are proposed [11], such as Adaptive Synthetic sampling ap-

proach (ADASYN) [12] and Borderline-SMOTE [13]. The

SMOTE-based extensions replace the original interpolation

procedure with other more complex ones, such as clustering

and probabilistic functions.

Furthermore, ﬁltering extensions after SMOTE are pro-

posed, such as SMOTE + Tomek [14] and SMOTE +

ENN [15]. To clarify the boundaries between classes, they

remove unnecessary samples from the dataset after the data

augmentation. They are kinds of undersampling techniques.

In this paper, to focus on evaluating the data augmentation of

time-series data, the combination with undersampling is out

of scope. However, those undersampling techniques can be

applied after the data augmentation by the proposed method.

For regression problems, SMOTER [8] and SMOGN [16],

which is an extension of SMOTER with Gaussian noise, are

proposed. These methods extend SMOTE for regression prob-

lems by using the relevance function representing the density

of the training data. In this paper, we extend this algorithm

for regression problems based on SMOTER, which is one

of the SMOTE-based extensions for regression problems. It

separates the minor cases from the distribution by a user-

deﬁned threshold. After the separation, it generates new cases

based on weighted averages between pairs in minor cases.

B. Data Balancing for Time-series classiﬁcation

For time-series classiﬁcation, some balancing methods were

proposed by generating the synthetic sample. TS SMOTE [9]

is an extension of SMOTE designed for time-series data.

It introduced DTW in the time-series merging algorithm in

the augmentation. On the other hand, OHIT [17] was pro-

posed as an oversampling method for the imbalanced time-

series classiﬁcation. OHIT is different from the state-of-the-art

oversampling algorithms because it generates the structure-

preserving synthetic samples. These methods are based on

generating samples using the feature of time-series categorized

as the same class. In contrast, we propose a data balancing

method for time-series regression, which can not use classiﬁed

time-series samples.

C. Learning-based Data Balancing

Another approach for data augmentation is known as gen-

erative models [18] which are based on deep learning, such as

Generative Adversarial Networks (GANs) [19] and Variational

Autoencoders (VAEs) [20]. These learning-based approaches

can generate synthetic time-series data and augment training

dataset effectively [21]. The algorithms model the real data

distribution Prby learning a distribution Pθparameterized

by θ. The data is generated by learning a function gθwhich

transform a noise with gaussian distribution Zsuch that Pθ≈

gθ(Z). The approaches generate realistic values in several

domains, such as computer vision and cybersecurity [22].

However, the generative models need training, which means

the generative model may overﬁt common values when we

use an imbalanced dataset. To deal with the problem, it needs

a specialized loss function, which makes the model more

complicated, or data balancing before the training. Therefore,

we propose the data augmentation method based on the

algorithmic approach, which does not need training.

III. PROP OS ED ME TH OD

A. Problem Deﬁnition

Imbalanced regression is a sub-class of regression prob-

lems [23]. In this setting, given a training set D=

{hxi, yii}N

i=1, the goal is to obtain a model m(x)that approx-

imates an unknown regression function Y=f(x)as deﬁned

in Ref. [23]. In the problem we address, xis a feature vector

that contains time-series data from Nfsensors as follows.

xi={f1,...,fNf}(1)

fj={s1, . . . , sNs}(2)

where Nfis the number of time-series features, and Tis the

length of the sample of the time-series fj. In this paper, Ns

is the same constant value for any fj. This is because basic

learning approaches for time-series, such as Recurrent Neural

Network (RNN), assume the same length of time-series as

their input.

B. Data Balancing

The overview of the proposed method is shown in Fig. 1.

The input and output are an imbalanced original dataset with

time-series features and a balanced dataset, respectively. First,

a relevance function φ(y)is generated based on the distribution

of the target value yin the original dataset as proposed in

SMOTER. The relevance function is automatically generated

Relevance function

generation

threshold

common rarerare

Separation into

common and rare Time-series

data generation

( Oversampling )

Undersampling

Proposed method

Original dataset Balanced dataset

Fig. 1. Overview of the proposed method.

Time-series 𝑥"

Time-series 𝑥#

Synthetic data

1 − 𝛼

𝑡

Time-series 𝑥

𝛼

Target value 𝑦

1 − 𝛼

𝛼

𝑦"

𝑦#

Fig. 2. Key idea of time-series data generation for regression.

based on a probability density function (pdf) [24]. Second, the

original distribution is separated into minor data Drand major

data Dcby a user-deﬁned threshold tEas follows.

Dr={hx, yi ∈ D|φ(y)> tE}(3)

Dc=D\Dr(4)

After this separation, oversampling and undersampling are

carried out on Drand Dc, respectively.

Based on Dr, we apply time-series data generation. Figure 2

illustrates our key idea based on SMOTER. A synthetic data

is interpolated between two samples in Drwith a ratio α,

which is determined by the weighted average of them. The

data generation in SMOTER needs to deﬁne a distance of

a pair of x. Our main contribution is combining the DTW

distance with SMOTER. The distance can be used for time-

series interpolative generation conserving original time-series

features, such as shape [9]. As shown in Fig. 2, synthetic points

in the generated time-series are interpolated between a pair,

which is given by DTW, using the ratio α.

The pseudocode of the main algorithm for data balancing is

shown in Algorithm 1. First, based on the SMOTER algorithm,

the threshold tEof φ(y)is determined to separate the samples

in the dataset into the major and minor classes. φ(y)is a

function that shows the uniqueness of the target value y. A

larger φ(y)corresponds to a smaller number of samples. The

data generation for minor classes is performed by dividing the

median of yinto smaller and larger samples than ˜y.

The method for generating time-series data is shown in

Algorithm 2. The k-nearest neighbors of the minor class

sample case are extracted based on the distance calculated

by Dynamic Time Warping [25]. A new time-series dataset

Algorithm 1 Main SMOTER algorithm.

Input: D- A data set

tE- Threshold

%o, %u- Percentages of over- and under-sampling

k- Number of neighbors used in case generation

Output: Dnew - A generated data set

1: rareL ⇐ {hx, yi ∈ D :φ(y)> tE∧y < ˜y}

2: newCasesL ⇐GENSYNTHCASES(rareL, %o, k)

3: rareH ⇐ {hx, yi∈D:φ(y)> tE∧y > ˜y}

4: newCasesH ⇐GENSYNTHCASES(rareH, %o, k)

5: newCases ⇐newCasesL ∪newCasesH

6: nrN orm ⇐%uof |newCases|

7: normCases ⇐sample of nrN orm case ∈ D \ {rareL ∪

rareH }

8: return newCases ∪normCases

Algorithm 2 Generating synthetic cases.

Input: D- A dataset

%o- Percentages of oversampling

k- Number of neighbors used in case generation

Output: Dgen - Generated new cases

1: newCases ⇐ {}

2: ng ⇐%o/100

3: for all case ∈ D do

4: nns ⇐KNN(k, case, D \ {case})

5: for i⇐1to ng do

6: n⇐randomly choose one of the nns

7: α⇐randomly choose in [0,1]

8: new[y]⇐min(case[y], n[y]) + α|case[y], n[y]|

9: for all f∈features do

10: pairs ⇐DTW(case[f], n[f])

11: for t⇐1to |pairs|do

12: (tnew, vnew )⇐TIMEPOINT(pairs[t]case ,

pairs[t]n, case[f], n[f], α)

13: new[f, tnew]⇐vnew

14: end for

15: end for

16: newCases ⇐newCases ∪ {new}

17: end for

18: end for

19: return newCases

20: function TIMEPOINT(ta, tb, tsa, tsb, α)

21: tnew ⇐min(ta, tb)+(ta+tb)/2

22: vnew ⇐min(tsa[ta], tsb[tb]) + α|tsa[ta]−tsb[tb]|

23: return tnew, vnew

24: end function

new is generated by TS SMOTE for the neighbor n, randomly

selected from the k-nearest neighbors. After the target number

of samples ng is generated by repeating the above steps, the

new generated dataset newCases is returned.

3210123

TSV

100

200

300

400

500

600

700

800

Number of samples

Fig. 3. Distribution of TSV.

35.50 35.75 36.00 36.25 36.50 36.75 37.00 37.25 37.50

Core body temperature [ C]

100

125

150

175

200

Number of samples

Fig. 4. Distribution of core body temperature.

IV. EVALUATION

A. Dataset

The ﬁrst dataset is a thermal sensation dataset collected

from 21 subjects. The target value of the dataset is Thermal

Sensation Vote (TSV). The subjects reported TSV within

the range of [-3.5, 3.5] by moving the seek bar up and

down on our smartphone application. The scale is called

the American Society of Heating, Refrigerating, and Air-

Conditioning Engineers’ seven-point thermal scale [26], which

is widely used as the metrics of human thermal sensation; the

seven levels range from -3 to +3 (Cold, Cool, Slightly cool,

Neutral, Slightly warm, Warm, Hot). In total, we collected

1686 TSV inputs. The distribution of the TSV inputs is

shown in Fig. 3. Red hatched areas in the ﬁgure highlight

the minor distribution. Because most of the data are collected

in an air-conditioned environment, most of the TSVs labeled

by the subjects are +1 (Slightly warm), 0 (Neutral), or -

1 (Slightly cool). Three feature values are heart rate, skin

temperature, and electrodermal activity collected by an E4

wristband sensor [27]. They are measured continuously as

time-series data.

The second dataset is a core body temperature dataset

collected from 13 subjects while exercising. The target value

is the core body temperature recorded by a cosinuss° C-

Temp [28], which can measure tympanic temperature. The

feature values are measured by two physiological sensors and

𝒙

(3, 10)

LSTM

ReLU

(16)

ReLU

Dropout(0.25)

(1)

𝑦

(16)

Dropout(0.5)

Fig. 5. TSV estimator.

an environmental sensor. The ﬁrst one of the physiological

sensors is WHS-3 [29], which is a wearable heart rate sensor

with a chest strap. We use heart rate and in-cloth temperature

from WHS-3 as feature values. The second one is the E4

wristband sensor as used in the TSV dataset. In addition to

the three feature values used in the TSV dataset, we also use

acceleration from it. Also, we use an environmental sensor to

measure air temperature and relative humidity as time-series.

In total, eight features are input to an estimator. We collected

750 pairs of the core body temperature and ten minutes of the

eight time-series data through the experiment. The distribution

of the core body temperature is shown in Fig. 4.

B. Evaluation Metrics

We evaluate the proposed method with evaluation metrics

based on precision Pr, recall Rr, and F-measure Frfor

regression problems as proposed in Ref. [30]. Intuitively, the

metrics become larger (i.e. better) when an estimator outputs

closer values in rarer cases. The deﬁnition is given below.

Pr=Pφ( ˆyi)>tEφ( ˆyi)α( ˆyi, yi)

Pφ( ˆyi)>tEφ(ˆy),(5)

Rr=Pφ(yi)>tEφ(yi)α( ˆyi, yi)

Pφ(yi)>tEφ(yi),(6)

Fr=2PrRr

Pr+Rr

,(7)

where yiand ˆyiare an actual value and an estimated value

by inputting xito the estimator, respectively. As deﬁned in

Ref. [30], the function αis deﬁned as follows.

α( ˆyi, yi) = I(L( ˆyi, yi)≤tL)(1 −exp( −k(L( ˆyi, yi)−tL)2

tL2)),(8)

where Iis an indicator function which is one if its argument

is true and zero otherwise. tLis a threshold deﬁning an

acceptable error within the domain a metric loss function

L, e.g., the absolute deviation. kis a positive number that

determines the shape of the function.

C. Estimator Design

To estimate the numerical target value y, we construct a

deep learning-based estimator for each dataset using an LSTM

layer. Figures 5 and 6 show estimators for the thermal sensa-

tion dataset and core body temperature dataset, respectively.

The shape of xis (Nf, Ns)as deﬁned in Section III-A. The

𝒙

(8, 10)

LSTM

ReLU

(32)

ReLU

Dropout(0.25)

(1)

𝑦

(32)

Dropout(0.25)

Fig. 6. Core body temperature estimator.

Time

Time-series A

Time-series B

SMOTER

Proposed

Fig. 7. Example of time-series generation by proposed method.

input xconsists of three time-series data whose length Nsis

ten.

D. Effect of Data Balancing

We evaluate the proposed method through comparison with

baseline, which is without any data balancing, and SMOTER-

based time-series interpolation with Euclidean distance. The

latter method generates a time-series sample by interpolating

a synthetic sample between corresponding samples in original

time-series data as shown in Fig. 7. The ﬁgure shows an

example of the time-series generation by each method given

two time-series samples. As illustrated in the ﬁgure, SMOTER

fails to inherit features of the original time-series samples such

as the maximum and minimum values. On the other hand,

the proposed method generates a synthetic time-series sample

with the features of the original time-series samples such as

the shape, maximum, and minimum values.

Table II and Table III shows the evaluation results of the

TSV dataset and the core temperature dataset, respectively.

We note that in addition to the evaluation metrics deﬁned in

Section IV-B, we evaluate the mean absolute error (MAE). We

set the parameters for Table II as tE= 0.5,tL= 1, and k=

100. For Table III, we set them as tE= 0.5,tL= 1, and k=

10. As shown in the table, the baseline, which is an estimator

learning imbalanced dataset, result in low precision for minor

cases. This result is similar to the imbalanced classiﬁcation

problem. The result of Frshows that the baseline method fails

to estimate minor cases within the acceptable error. In addition,

TABLE II

EST IMATI ON R ESU LT FOR TSV D ATASET.

Method PrRrFrMAE

Baseline 0.00 0.18 0.00 0.52

SMOTER [8] 0.43 0.31 0.36 0.56

Proposed method 0.75 0.32 0.44 0.56

TABLE III

EST IMATI ON R ESU LT FOR C OR E TEM PE RATUR E DATASE T.

Method PrRrFrMAE

Baseline 0.35 0.51 0.40 0.38

SMOTER [8] 0.45 0.50 0.44 0.38

Proposed method 0.77 0.60 0.67 0.35

the result of SMOTER with Euclidean distance showed lower

improvement than the proposed method in Table II. This is

because the time-series generation based on the DTW distance

can generate time-series samples close to the original time-

series samples. As a result, the proposed method remarkably

enhances the estimator’s performance for minor cases with

a little decline of the MAE compared with the baseline. In

Table III, the result of the proposed method is superior to

the result of the SMOTER. The proposed method’s MAE is

improved from the baseline because the estimator can estimate

in a larger range than the baseline, which is overﬁtted to a

major value.

V. CONCLUSION

In this study, we proposed a data balancing method for

regression datasets, including physiological time-series data.

We demonstrated that the estimation accuracy improved for

rare cases by balancing datasets, including time-series data

measured by physiological sensors. The result indicated that

the balancing method helps the effective extraction of physio-

logical time-series features to estimate numerical values. Our

future work includes further investigation of the applicable

range of the variety of physiological time-series data using

more datasets.

REFERENCES

[1] L. Jiang, X. Lin, X. Liu, C. Bi, and G. Xing, “Safedrive: Detecting

distracted driving behaviors using wrist-worn devices,” Proc. ACM

Interact. Mob. Wearable Ubiquitous Technol., vol. 1, no. 4, Jan. 2018.

[2] J. Costa, F. Guimbreti`

ere, M. F. Jung, and T. Choudhury, “Boostmeup:

Improving cognitive performance in the moment by unobtrusively regu-

lating emotions with a smartwatch,” Proc. ACM Interact. Mob. Wearable

Ubiquitous Technol., vol. 3, no. 2, Jun. 2019.

[3] C. L. Castro and A. P. Braga, “Novel cost-sensitive approach to

improve the multilayer perceptron performance on imbalanced data,”

IEEE transactions on neural networks and learning systems, vol. 24,

no. 6, pp. 888 – 899, 2013.

[4] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote:

Synthetic minority over-sampling technique,” Journal of Artiﬁcial Intel-

ligence Research, vol. 16, p. 321–357, June 2002.

[5] J. Johnson and T. Khoshgoftaar, “Survey on deep learning with class

imbalance,” Journal of Big Data, vol. 6, p. 27, 03 2019.

[6] L. Barrios and W. Kleiminger, “The comfstat - automatically sensing

thermal comfort for smart thermostats,” in Proceedings of the 2017 IEEE

International Conference on Pervasive Computing and Communications,

Hawaii, USA, March 2017, pp. 257–266.

[7] W. Hu, Y. Luo, Z. Lu, and Y. Wen, “Heterogeneous transfer learning

for thermal comfort modeling,” in Proceedings of the 6th ACM Inter-

national Conference on Systems for Energy-Efﬁcient Buildings, Cities,

and Transportation, November 2019, p. 61–70.

[8] L. Torgo, R. Ribeiro, B. Pfahringer, and P. Branco, “Smote for re-

gression,” Progress in Artiﬁcial Intelligence, vol. 8154, pp. 378–389,

September 2013.

[9] E. A. de la Cal, J. R. Villar, P. M. Vergara, ´

Alvaro Herrero, and J. Sedano,

“Design issues in time series dataset balancing algorithms,” Neural

Computing and Applications, pp. 1–18, January 2019.

[10] S. I. Nikolenko, “Synthetic data for deep learning,” 2019.

[11] A. Fern´

andez, S. Garc´

ıa, F. Herrera, and N. V. Chawla, “Smote for

learning from imbalanced data: Progress and challenges, marking the

15-year anniversary,” Journal of Artiﬁcial Intelligence Research, vol. 61,

pp. 863–905, 2018.

[12] H. He, Y. Bai, E. A. Garcia, and S. Li, “Adasyn: Adaptive synthetic

sampling approach for imbalanced learning,” in 2008 IEEE Interna-

tional Joint Conference on Neural Networks (IEEE World Congress on

Computational Intelligence), 2008, pp. 1322–1328.

[13] H. Han, W.-Y. Wang, and B.-H. Mao, “Borderline-smote: A new

over-sampling method in imbalanced data sets learning,” Advances in

Intelligent Computing, vol. 3644, pp. 878–887, 09 2005.

[14] G. Batista, A. Bazzan, and M.-C. Monard, “Balancing training data for

automated annotation of keywords: a case study.” in Proc. Of Workshop

on Bioinformatics, 01 2003, pp. 10–18.

[15] G. E. A. P. A. Batista, R. C. Prati, and M. C. Monard, “A study of

the behavior of several methods for balancing machine learning training

data,” SIGKDD Explor. Newsl., vol. 6, no. 1, p. 20–29, Jun. 2004.

[16] P. Branco, L. Torgo, and R. P. Ribeiro, “Smogn: a pre-processing

approach for imbalanced regression,” in Proceedings of 1st International

Workshop on Learning with Imbalanced Domains: Theory and Applica-

tions, September 2017.

[17] T. Zhu, C. Luo, J. Li, S. Ren, and Z. Zhang, “Oversampling for

imbalanced time series data,” 2020.

[18] H. GM, M. K. Gourisaria, M. Pandey, and S. S. Rautaray, “A compre-

hensive survey and analysis of generative models in machine learning,”

Computer Science Review, vol. 38, p. 100285, 2020.

[19] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,

S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in

Advances in Neural Information Processing Systems, vol. 27. Curran

Associates, Inc., 2014, pp. 2672–2680.

[20] D. P. Kingma and M. Welling, “Auto-Encoding Variational Bayes,” in

Proceedings of 2nd International Conference on Learning Representa-

tions, ICLR 2014, Banff, AB, Canada, April 2014.

[21] Q. Wen, L. Sun, X. Song, J. Gao, X. Wang, and H. Xu, “Time series data

augmentation for deep learning: A survey,” ArXiv, vol. abs/2002.12478,

2020.

[22] M. Z. Alom, T. M. Taha, C. Yakopcic, S. Westberg, P. Sidike, M. S.

Nasrin, M. Hasan, B. C. V. Essen, A. A. S. Awwal, and V. K. Asari,

“A state-of-the-art survey on deep learning theory and architectures,”

Electronics, vol. 8, no. 3, 2019.

[23] P. Branco, L. Torgo, and R. P. Ribeiro, “Pre-processing

approaches for imbalanced distributions in regression,” Neurocomputing,

vol. 343, pp. 76 – 99, 2019, learning in the Presence

of Class Imbalance and Concept Drift. [Online]. Available:

http://www.sciencedirect.com/science/article/pii/S0925231219301638

[24] L. Torgo and R. Ribeiro, “Utility-based regression,” in Knowledge

Discovery in Databases: PKDD 2007. Berlin, Heidelberg: Springer

Berlin Heidelberg, 2007, pp. 597–604.

[25] D. J. Berndt and J. Clifford, “Using dynamic time warping to ﬁnd pat-

terns in time series,” in Proceedings of the 3rd international conference

on knowledge discovery and data mining, ser. AAAIWS’94. AAAI

Press, July 1994, pp. 359–370.

[26] American Society of Heating, Refrigerating and Air-

Conditioning Engineers, Thermal Environmental Conditions for

human occupancy. ASHRAE Standard 55-2017, 2017. [On-

line]. Available: https://www.ashrae.org/technical-resources/standards-

and-guidelines/read-only-versions-of-ashrae-standards

[27] Empatica, “Real-time physiological signals — E4 EDA/GSR sensor,”

https://www.empatica.com/en-int/research/e4/, Accessed on April 29,

2021.

[28] cosinuss°, “cosinuss° one – performance monitoring,”

https://www.cosinuss.com/en/products/one/, Accessed on April 29,

2021.

[29] UNION TOOL, “Whs-3 wearable heart rate sensor,”

https://www.uniontool-mybeat.com/SHOP/8600085.html, Accessed

on April 29, 2021 (in Japanese).

[30] L. Torgo and R. Ribeiro, “Precision and recall for regression,” Discovery

Science, vol. 4702, pp. 597 – 604, 2007.

The Role of Explainable AI in the Research Field of AI Ethics

Article

Jun 2023

Ethics of Artificial Intelligence (AI) is a growing research field that has emerged in response to the challenges related to AI. Transparency poses a key challenge for implementing AI ethics in practice. One solution to transparency issues is AI systems that can explain their decisions. Explainable AI (XAI) refers to AI systems that are interpretable or understandable to humans. The research fields of AI ethics and XAI lack a common framework and conceptualization. There is no clarity of the field’s depth and versatility. A systematic approach to understanding the corpus is needed. A systematic review offers an opportunity to detect research gaps and focus points. This paper presents the results of a systematic mapping study (SMS) of the research field of the Ethics of AI. The focus is on understanding the role of XAI and how the topic has been studied empirically. An SMS is a tool for performing a repeatable and continuable literature search. This paper contributes to the research field with a Systematic Map that visualizes what, how, when, and why XAI has been studied empirically in the field of AI ethics. The mapping reveals research gaps in the area. Empirical contributions are drawn from the analysis. The contributions are reflected on in regards to theoretical and practical implications. As the scope of the SMS is a broader research area of AI ethics the collected dataset opens possibilities to continue the mapping process in other directions.

Time Series Data Augmentation for Deep Learning: A Survey

Conference Paper

Full-text available

Aug 2021

Deep learning performs remarkably well on many time series analysis tasks recently. The superior performance of deep neural networks relies heavily on a large number of training data to avoid overfitting. However, the labeled data of many real-world time series applications may be limited such as classification in medical time series and anomaly detection in AIOps. As an effective way to enhance the size and quality of the training data, data augmentation is crucial to the successful application of deep learning models on time series data. In this paper, we systematically review different data augmentation methods for time series. We propose a taxonomy for the reviewed methods, and then provide a structured review for these methods by highlighting their strengths and limitations. We also empirically compare different data augmentation methods for different tasks including time series classification, anomaly detection, and forecasting. Finally, we discuss and highlight five future directions to provide useful research guidance.

A comprehensive survey and analysis of generative models in machine learning

Article

Full-text available

Jul 2020

Generative models have been in existence for many decades. In the field of machine learning, we come across many scenarios when directly learning a target is intractable through discriminative models, and in such cases the joint distribution of the target and the training data is approximated and generated. These generative models help us better represent or model a set of data by generating data in the form of Markov chains or simply employing a generative iterative process to do the same. With the recent innovation of Generative Adversarial Networks (GANs), it is now possible to make use of AI to generate pieces of art, music, etc. with a high extent of realism. In this paper, we review and analyse critically all the generative models, namely Gaussian Mixture Models (GMM), Hidden Markov Models (HMM), Latent Dirichlet Allocation (LDA), Restricted Boltzmann Machines (RBM), Deep Belief Networks (DBN), Deep Boltzmann Machines (DBM), and GANs. We study their algorithms and implement each of the models to provide the reader some insights on which generative model to pick from while dealing with a problem. We also provide some noteworthy contributions done in the past to these models from the literature.

BoostMeUp: Improving Cognitive Performance in the Moment by Unobtrusively Regulating Emotions with a Smartwatch

Article

Full-text available

Jun 2019

A person's emotional state can strongly influence their ability to achieve optimal task performance. Aiming to help individuals manage their feelings, different emotion regulation technologies have been proposed. However, despite the well-known influence that emotions have on task performance, no study to date has shown if an emotion regulation technology can also enhance user's cognitive performance in the moment. In this paper, we present BoostMeUp, a smartwatch intervention designed to improve user's cognitive performance by regulating their emotions unobtrusively. Based on studies that show that people tend to associate external signals that resemble heart rates as their own, the intervention provides personalized haptic feedback simulating a different heart rate. Users can focus on their tasks and the intervention acts upon them in parallel, without requiring any additional action. The intervention was evaluated in an experiment with 72 participants, in which they had to do math tests under high pressure. Participants who were exposed to slow haptic feedback during the tests decreased their anxiety, increased their heart rate variability and performed better in the math tests, while fast haptic feedback led to the opposite effects. These results indicate that the BoostMeUp intervention can lead to positive cognitive, physiological and behavioral changes.

Survey on deep learning with class imbalance

Article

Full-text available

Mar 2019

The purpose of this study is to examine existing deep learning techniques for addressing class imbalanced data. Effective classification with imbalanced data is an important area of research, as high class imbalance is naturally inherent in many real-world applications, e.g., fraud detection and cancer detection. Moreover, highly imbalanced data poses added difficulty, as most learners will exhibit bias towards the majority class, and in extreme cases, may ignore the minority class altogether. Class imbalance has been studied thoroughly over the last two decades using traditional machine learning models, i.e. non-deep learning. Despite recent advances in deep learning, along with its increasing popularity, very little empirical work in the area of deep learning with class imbalance exists. Having achieved record-breaking performance results in several complex domains, investigating the use of deep neural networks for problems containing high levels of class imbalance is of great interest. Available studies regarding class imbalance and deep learning are surveyed in order to better understand the efficacy of deep learning when applied to class imbalanced data. This survey discusses the implementation details and experimental results for each study, and offers additional insight into their strengths and weaknesses. Several areas of focus include: data complexity, architectures tested, performance interpretation, ease of use, big data application, and generalization to other domains. We have found that research in this area is very limited, that most existing work focuses on computer vision tasks with convolutional neural networks, and that the effects of big data are rarely considered. Several traditional methods for class imbalance, e.g. data sampling and cost-sensitive learning, prove to be applicable in deep learning, while more advanced methods that exploit neural network feature learning abilities show promising results. The survey concludes with a discussion that highlights various gaps in deep learning from class imbalanced data for the purpose of guiding future research.

A State-of-the-Art Survey on Deep Learning Theory and Architectures

Article

Full-text available

Mar 2019

In recent years, deep learning has garnered tremendous success in a variety of application domains. This new field of machine learning has been growing rapidly and has been applied to most traditional application domains, as well as some new areas that present more opportunities. Different methods have been proposed based on different categories of learning, including supervised, semi-supervised, and un-supervised learning. Experimental results show state-of-the-art performance using deep learning when compared to traditional machine learning approaches in the fields of image processing, computer vision, speech recognition, machine translation, art, medical imaging, medical information processing, robotics and control, bioinformatics, natural language processing, cybersecurity, and many others. This survey presents a brief survey on the advances that have occurred in the area of Deep Learning (DL), starting with the Deep Neural Network (DNN). The survey goes on to cover Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), including Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU), Auto-Encoder (AE), Deep Belief Network (DBN), Generative Adversarial Network (GAN), and Deep Reinforcement Learning (DRL). Additionally, we have discussed recent developments, such as advanced variant DL techniques based on these DL approaches. This work considers most of the papers published after 2012 from when the history of deep learning began. Furthermore, DL approaches that have been explored and evaluated in different application domains are also included in this survey. We also included recently developed frameworks, SDKs, and benchmark datasets that are used for implementing and evaluating deep learning approaches. There are some surveys that have been published on DL using neural networks and a survey on Reinforcement Learning (RL). However, those papers have not discussed individual advanced techniques for training large-scale deep learning models and the recently developed method of generative models.

Design issues in Time Series dataset balancing algorithms

Article

Full-text available

Mar 2020
NEURAL COMPUT APPL

Nowadays, the Internet of Things and the e-Health are producing huge collections of Time Series that are analyzed in order to classify current status or to detect certain events, among others. In two-class problems, when the positive events to detect are infrequent, the gathered data lack balance. Even in unsupervised learning, this imbalance causes models to decrease their generalization capability. In order to solve such problem, Time Series balancing algorithms have been proposed. Time Series balancing algorithms have barely been studied; the different approaches make use of either a single bag of Time Series extracting some of them in order to generate a synthetic new one or ghost points in the distance space. These solutions are suitable when there is one only data source and they are univariate datasets. However, in the context of the Internet of Things, where multiple data sources are available, these approaches may not perform coherently. Besides, up to our knowledge there is not multiple datasources and multivariate TS balancing algorithms in the literature. In this research, we study two main concerns that should be considered when designing balancing Time Series algorithms: on the one hand, the TS balancing algorithms should deal with multiple multivariate data sources; on the other hand, the balancing algorithms should be shape preserving. A new algorithm is proposed for balancing multivariate Time Series datasets, as part of our work. A complete evaluation of the algorithm is performed dealing with two real-world multivariate Time Series datasets coming from the e-Health domain: one about epilepsy crisis identification and the other on fall detection. A thorough analysis of the performance is discussed, showing the advantages of considering the Time Series issues within the balancing algorithm.

Minority oversampling for imbalanced time series classification

Article

Apr 2022
KNOWL-BASED SYST

Many vital real-world applications involve time-series data with skewed distribution. Compared to traditional imbalanced learning problems, the classification of imbalanced time-series data is more challenging due to the high dimensionality and high inter-variable correlation. This paper proposes a structure-preserving Oversampling method to resolve the High-dimensional Imbalanced Time-series classification (OHIT). OHIT leverages a density-ratio-based shared nearest neighbor clustering algorithm to capture the modes of minority class in high-dimensional space. It for each mode applies the shrinkage technique of large-dimensional covariance matrix to obtain an accurate and reliable covariance structure. The structure-preserving synthetic samples are eventually generated based on the multivariate Gaussian distribution with the estimated covariance matrix. In addition, to further promote the performance of classifying imbalanced time-series data, we integrate OHIT into boosting framework to obtain a new ensemble algorithm OHITBoost. Extensive experiments on several publicly available time-series datasets (including unimodal and multimodal) demonstrate the effectiveness of OHIT and OHITBoost in terms of F1, G-mean, and AUC.

Synthetic Data for Deep Learning

Book

Jan 2021

Sergey I. Nikolenko

This is the first book on synthetic data for deep learning, and its breadth of coverage may render this book as the default reference on synthetic data for years to come. The book can also serve as an introduction to several other important subfields of machine learning that are seldom touched upon in other books. Machine learning as a discipline would not be possible without the inner workings of optimization at hand. The book includes the necessary sinews of optimization though the crux of the discussion centers on the increasingly popular tool for training deep learning models, namely synthetic data. It is expected that the field of synthetic data will undergo exponential growth in the near future. This book serves as a comprehensive survey of the field. In the simplest case, synthetic data refers to computer-generated graphics used to train computer vision models. There are many more facets of synthetic data to consider. In the section on basic computer vision, the book discusses fundamental computer vision problems, both low-level (e.g., optical flow estimation) and high-level (e.g., object detection and semantic segmentation), synthetic environments and datasets for outdoor and urban scenes (autonomous driving), indoor scenes (indoor navigation), aerial navigation, and simulation environments for robotics. Additionally, it touches upon applications of synthetic data outside computer vision (in neural programming, bioinformatics, NLP, and more). It also surveys the work on improving synthetic data development and alternative ways to produce it such as GANs. The book introduces and reviews several different approaches to synthetic data in various domains of machine learning, most notably the following fields: domain adaptation for making synthetic data more realistic and/or adapting the models to be trained on synthetic data and differential privacy for generating synthetic data with privacy guarantees. This discussion is accompanied by an introduction into generative adversarial networks (GAN) and an introduction to differential privacy.

Heterogeneous Transfer Learning for Thermal Comfort Modeling

Conference Paper

Nov 2019

For decades, the Predicted Mean Vote (PMV) model has been adopted to evaluate building occupants' thermal comfort. However, recent studies argue that the PMV model is inaccurate and suffers from two major issues: thermal comfort parameter inadequacy and modeling data inadequacy. To overcome these issues, in this paper, we propose a learning-based approach for thermal comfort modeling, named as Heterogeneous Transfer Learning (HTL) based Intelligent Thermal Comfort Neural Network (HTL-ITCNN). First, to address the parameter inadequacy issue, we add more relevant factors as the modeling features except for the six PMV parameters. Due to the flexibility of learning-based approaches, newly found thermal comfort parameters can be appended to extend the number of modeling features. Second, to mitigate the impact of the data inadequacy issue, we adopt the deep transfer learning techniques to train the thermal comfort model, where the model training would benefit from the transferred knowledge from the existing datasets. Due to the heterogeneity of the features among different datasets, we follow the HTL concept to conducting effective knowledge transfer among heterogeneous domains, which are the different but related datasets with varied features. To validate our solution, we conduct five-month data collection experiments and build our datasets. With the HTL-based two-stage learning paradigm, the experimental results show that the accuracy of HTL-ITCNN outperforms the PMV model by on average 73.9%. Besides, we verify the impacts of newly added features and knowledge transfer on model performance. Moreover, we demonstrate the enormous potential of personal thermal comfort modeling research.

Pre-processing Approaches for Imbalanced Distributions in Regression

Article

Feb 2019
NEUROCOMPUTING

Imbalanced domains are an important problem frequently arising in real world predictive analytics. A significant body of research has addressed imbalanced distributions in classification tasks, where the target variable is nominal. In the context of regression tasks, where the target variable is continuous, imbalanced distributions of the target variable also raise several challenges to learning algorithms. Imbalanced domains are characterized by: (1) a higher relevance being assigned to the performance on a subset of the target variable values; and (2) these most relevant values being underrepresented on the available data set. Recently, some proposals were made to address the problem of imbalanced distributions in regression. Still, this remains a scarcely explored issue with few existing solutions. This paper describes three new approaches for tackling the problem of imbalanced distributions in regression tasks. We propose the adaptation to regression tasks of random over-sampling and introduction of Gaussian Noise, and we present a new method called WEighted Relevance-based Combination Strategy (WERCS). An extensive set of experiments provides empirical evidence of the advantage of using the proposed strategies and, in particular, the WERCS method. We analyze the impact of different data characteristics in the performance of the methods. A data repository with 15 imbalanced regression data sets is also provided to the research community.

Time-Series Physiological Data Balancing for Regression

Recommended publications

On the physical consistency of evolution laws obtained with sparse regression

The balance between size and power in Dickey-Fuller tests with data-dependent rules for the choice o...

A New Missing Data Imputation Algorithm Applied to Electrical Data Loggers

TSVNet: Combining Time-Series and Opportunistic Sensing by Transfer Learning for Dynamic Thermal Sen...

ImbalancedLearningRegression - A Python Package to Tackle the Imbalanced Regression Problem

Geometric SMOTE for regression

Data balancing for thermal comfort datasets using conditional wasserstein GAN with a weighted loss f...