Occupancy Forecasting using two ARIMA Strategies


We present an occupancy forecast method in a smart home context based on the exploitation of environmental measures such as CO2, sound or relative humidity. This article presents our machine learning algorithm and prediction strategy. It is based on two levels of data exploitation. The first level is "supervised learning" to obtain past occupancy from sensor measurements. It is achieved with a multiple logistic regression algorithm. The second level consists in two main steps. During the first step ARIMA learns and trains the model, using the past occupancy data from level 1. During the second step ARIMA predicts the future occupancy. The innovative part of our paper is that we compare two different ARIMA's (de-seasonalised). The first is the "day-sequence-time-series" (a serial ARIMA). The second is the "daily-slice-time-series" (a parallel ARIMA). We conclude by analyzing the performance of our occupancy prediction paradigm. 1 Introduction The context of our study is energy efficiency. Energy efficiency has been achieved in recent years by working on the insulation of the building envelope. This strategy has achieved optimal levels of energy performance. Additional gains are now to be sought in optimal thermal regulation. The strategy is to permanently adapt the comfort situation to the living situation. To do this, it is necessary to automatically characterise the activity of the occupants in the building. In today's innovative technological design for smart buildings, the key problem we are faced with is understanding the consumer's behaviour. Our occupancy forecast strategy will in future allow for energy savings in a smart building context. The control/command strategy of the heater will be presented in an upcoming paper. In this article, we address the principle of our method of occupancy prediction. The method of occupancy forecasting exposed in this paper contains one remarkable contribution: we compute two original ARIMA strategies for the forecast of occupancy. The first is a "Day Sequence" Time Series, which is a common process and the second is a "Daily Time Slice" Time Series which is an unusual process. The second ARIMA consists in forecasting the probability of occupancy of just one time slice (30 minutes). Then, with a loop, we reconstitute a full day by assembling all the time-slices results. We present a comparative analysis of our two ARIMA models against several criteria (error, reliability, temporal consistency, etc.). Finally, we propose conclusions and perspectives for using our prediction algorithms in an intelligent regulation paradigm in the context of energy saving.
2 Related works
Characterisation of human activity and the ability to predict, it is a major issue in
many disciplinary fields. Many proposals for methods have already been suggested in
the medical field (such as personal assistance), in the energy efficiency field and in
many others. In [1], a complete monitoring architecture is presented, including home
sensors and cloud-based back-end services. In this article, supervised techniques for
behavioural data analysis are proposed using regression methods and ARIMA. By
means of inductive and deductive reasoning, the authors of the article [2] introduce a
framework to detect occupant activity and potentially wasted energy consumption. This
framework consists of three sub-algorithms for action detection, activity recognition
and waste estimation. Unsupervised clustering models are used to detect the occurred
actions. In paper [3] a new approach to modeling human behaviour patterns is
suggested. The authors use Markov chains to determine an unsupervised model of
human behaviour and to detect the deviation over time. Deviating behaviour is revealed
through data clustering and analysis of associations between clusters and data vectors
representing adjacent time intervals. The activity recognition is also used in [4], which
proposes learning customized structural models for common user activities in order to
predict the trend of energy consumption. The recognition algorithm is based on
recursive structures of user activities obtained from raw sensor readings. Artificial
neural networks (ANN) are used in [5] [6] [7] to manage resident activity recognition
in Smart Homes. The authors in [5] tackle three ANN algorithms for human activity
recognition, namely: Quick Propagation (QP), Levenberg Marquardt (LM) and Batch
Back Propagation (BBP). In the same way, an unsupervised learning strategy is used in
[8] to improve activity recognition in smart environments. In [9] and [10] the Support
Vector Machines (SVM) are used to address the same problem.
3 Theoretical framework
3.1 The Multiple Variables Logistic Regression (MLR):
Logistic Regression is a statistical learning algorithm developed by David Cox in
1958. Its purpose is to reconstruct a qualitative variable Y as a function of one (simple
regression) or several (multiple regression) explanatory variables . A
discussion on logistic regression (and variants) can be found in detail in the book by
Hastie et al. [11]. The main idea is to express certain log-odds as linear functions of the
Xi, using equations similar to classical linear regression.
When Y is binary, it suffices to define󰇛󰇜󰇛 󰇜 and to assume
that its log-odds is a linear function of the explanatory variables:
 󰇛󰇜
󰇛󰇜 (Eq. 1)
where the coefficients are parameters to be estimated.
The MLR finds estimates for the parameters by maximizing the log-
likelihood function 󰇛󰇜with the NewtonRaphson iterative method (the solution
has no closed form): at each step, the estimates are updated by
  󰇡
󰇢 
 (Eq. 2)
Once we have estimated
, we obtain an estimated probability function:
(Eq. 3)
This 󰇛󰇜is calculated at each instant of measurement, giving the posterior
probability of occupation as a function of time, as mentioned at the end of § 3.1.
Its interpretation is as follows: when it is close to 1, the measurements indicate
that the occupant is present; when it is close to 0, the measurements indicate that
the occupant is absent; when it is close to 0.5, the measurements could be
associated with either presence or absence.
3.2 Pre-processing with STL
The STL method decomposes a time series into the sum of three components:
seasonal, trend, and residual (or remainder) using Loess (non-linear regression
technique) [15]. An STL decomposition of our data is shown in Figure 1 below. Here,
the seasons correspond to days. We call de-seasonalised data the residual component.
It will be handed to several ARIMA strategies 4.1). Finally, we will add the trend
and the seasonal components back to the ARIMA results to obtain occupancy
probability forecasts: we will call this operation re-seasonalising the data.
An ARIMA (Auto Regressive Integrated Moving Average) [12] model is a
statistical model for analyzing and forecasting time series data. Adopting an ARIMA
model for a time series assumes that the underlying process that generated the
observations is an ARIMA process; i.e. Stationarity [13]. The data will follow the same
general trends and patterns as in the past [14]. This may seem obvious but helps to
motivate the need to confirm the assumptions of the model in the raw observations and
in the residual errors of forecasts from the model.
4 Proposed Method
We propose an occupancy prediction method in a smart home context based on the
exploitation of the measurements of the sensors disseminated in the building. Our
paradigm is based on two consecutive steps integrating the learning process:
- to determine occupancy probability (MLR) based on sensor data
- to forecast occupancy in the near future (STL-ARIMA)
4.1 Forecasting step using ARIMA class model by de-seasonalising data with 2
Strategy 1 (Serial): “Day Sequence” Time Series Processes Model
This model handles the time series in a classical way: the probabilities of occupancy
form a single sequence treated by ARIMA. Here, we assume that whole days will
follow the same general trends and patterns as in the past. However, we separate the
weekdays from the weekends, and work independently on the two resulting samples (in
one, Fridays are followed by Mondays, and in the other one, Sundays are followed by
Saturdays). We implement our new “weekday” database and the two types of seasonal
variables in the Day Sequence Time Series process (database=[2016,1] (weekday and
30 minute step phases)) and we forecast one day ahead (48 steps of 30 minutes each)
with the STL-ARIMA function. The STL ARIMA can be written as:
(Eq. 4)
 (days sub-seasons slice per 30 minutes).
: Errors
j: numbers step-ahead forecasts
: our benchmark time
: Level
: Trend
: Seasonal
Strategy 2 (Parallel): “Daily Time Slice” Time Series Processes Model.
This model handles the time series in an innovative way: we define 48 time slices
per day (each 30 minutes long) and then form a sample for each time slice. We designed
this model to take advantage of the regularity per time slices on multiple days (the
occupant's “habits”). Hence, 48 instances of ARIMA are performed on shorter
sequences than in Strategy 1. For instance, one ARIMA handles only the probabilities
of presence for the time-slice 8:00 to 8:30 am, each data point coming from a different
day. Therefore, we use the same database as in Strategy 1, converted into a probability
matrix [42x48] that corresponds to 42 days and 48 slices of time per day. This strategy
can be seen as 48 “parallel” ARIMAs, whereas Strategy 1 consists of 48 “serial”
ARIMAs. We also forecast one day ahead with the STL-ARIMA function, but here it
just corresponds to one step in time (one day) for each of the 48 slices (30 minutes).
This can be written as:
(Eq. 5)
(weekdays season)
: Errors
i: numbers slice of time
: our benchmark time.
: Level
: Trend
: Seasonal
4.2 Implementation algorithm
To determine occupancy (the variable of interest), we use data from Netatmo(c) and
infrared sensors disseminated in the environment: we have relative humidity (Hr%),
CO2 (ppm), and infrared measurements (PIR 0/1) to determine whether or not the
occupant is present. We obtain the probability of occupancy by supervised learning,
fitting PIR as a function of the others with multiple logistic regression (MLR). Then,
we aim to compare and/or combine two forecast algorithms based on ARIMA models,
differing by the strategy for reassembling the time samples: the “Day Sequence” time
series” (48 serial ARIMAs) and the “Daily Time Slice” time series” (48 parallel
The prediction data are reorganised (split) in order to set data to both ARIMA
(§4.1). At the “end” of the process the parallel ARIMA forecast are merge together in
order to obtain an entire day. The serial ARIMA provides a day forecast directly. Figure
3 illustrates the sequence of computations involve in this method.
5 Results and Discussion
5.1 The raw data in the Learning phase
Our perception system is composed of four sources. For each source the sampling
rate of the raw data is 5 minutes. The input of the data is almost synchronous. The
sensors’ data include room temperature (°C), CO2 levels (ppm), relative hygrometry
(% Hr) and passive infrared (PIR, 0/1), as shown in the Figure 4. This dataset covers
the period stretching from 1 January to 28 February 2017. We esteem that this time
range is sufficiently long to evaluate the occupancy behaviour.
The raw dataset is used to train and test a classification in order to determine
occupancy probability. The PIR data is only used for the MLR (Multiple Variable
Logistic Regression) classification, to supervise (training) and to control the estimation
(testing). The purpose of this classification is to replace all raw data by a new dataset
that represents the occupancy probability as a function of time.
5.2 The time series data (Occupancy probability)
In the Figure 5 the reader will find our times series data that was rendered by the
Learning phase (MLR).
Because we are using past data to predict future data, we should assume that the
data will follow the same general trends and patterns as in the past. This general
statement holds for most training data and modelling. The rolling mean and standard
deviation look like they change over time. There may be some de-trending and
removing seasonality involved. Applying log transformation, and first-order
differencing makes the data more stationary over time. This makes the data suitable to
be used in our ARIMA models.
5.3 Forecasting Probabilities
In Figure 6 below, the dataset covers the 01/01/17-28/02/17 period, and we forecast
the next day’s hourly results (the 48 steps period) with the 2 strategies described in §
4.1 (grey and orange curves). To assess the accuracy of the forecast, we use as reference
the output of the MLR classification of a known day (01/03/2017, blue curve). All
values are re-seasonalized.
The STL-ARIMA forecasting with the “Day Sequence” process (Serial) tends to
overestimate the occupancy probability and is smoother. The STL-ARIMA forecasting
with the “Daily Time Slice” process (Parallel) is jagged: sometimes it overestimates,
sometimes it underestimates. Both manage to anticipate the rise of the occupancy
probability, but a little too soon (34 units instead of 39 time units, so about 2h30 in real
The difference in smoothness is not very surprising, since the Serial strategy
corresponds to a single autoregressive model whereas the Parallel strategy corresponds
to 48 intertwined models: in the Serial, the auto-regression equation uses actually
successive data (previous half hours); in the Parallel, each forecast point is obtained as
a function of more distant points in time (previous days).
In Figure 7 below, we report only the Serial strategy forecast (previous grey curve),
with the associated confidence interval. In Figure 8 below, we report only the Parallel
strategy forecast (previous orange curve), with the associated confidence interval.
Globally, all confidence intervals are quite narrow, which means that our horizon
of forecast (1 day) is suitable. Perhaps more surprisingly, the sizes of these intervals are
very similar with both strategies. The Serial strategy learns with more data (2 long
subsamples) but its forecast horizon is farther (48 time units); the Parallel strategy
learns with fewer data (48 short subsamples) but its forecast horizon in nearer (1 time
unit). It seems that the influence of these two factors (sample size, time horizon)
counterbalance each other.
In Figure 9 below, we report several statistical indicators that aim to assess the
performance of our two strategies on the test day (01/03/2017). We also report the
performance of a more naïve ARIMA without the preliminary STL step (Auto-
Most indicators are similar among the three methods. Some notable exceptions are
the MAE (mean absolute error), the MPE (mean percentage error) and the MAPE (mean
absolute percentage error). Both our strategies perform better than the naive one
according to the MAE, and our Parallel method performs better than the others
according to the MPE and the MAPE. It seems that our innovative strategy has real
qualities and deserves interest.
6 Conclusion and Perspective
In this article, we have proposed to deal with occupancy forecasting in a Smart
Building context. Occupancy forecasting allows a smart control of HVAC devices in
order to save energy and optimise comfort. We presented a forecasting strategy of
occupancy mainly based on four steps. First, from direct data measurements (CO2, PIR,
Hr, Temp), we define an occupancy probability based on MLR classification. Then, we
remove the seasonal component of the time series of occupancy through the STL-
method. The third step predicts the temporal signal (occupancy) with two ARIMA
strategies: one is the “Day Sequence” (Serial) and the other is the “Daily Time Slice”
(Parallel). Finally, we add the seasonal component back.
The cautious reader will have noticed that in Figure 7 , the forecasts are negative
between 8 and 11 units of time. This cannot represent a suitable probability, and is due
to the fact that ARIMA works with unconstrained real values. We plan to solve this
problem by using the ARIMA strategies on the log-odds instead of the probabilities.
Globally, both ARIMA strategies give suitable results with low uncertainties. The
“Daily Time Slice” forecast is more dynamic than the “Day Sequence” one, but has
similar uncertainties, at least for a 1-day horizon. It is well known that as the forecast
horizon increases, the confidence intervals’ size tends to rise. Our two strategies might
exhibit a difference in the speed of this size increase. This question will be addressed
in future works.
