ArticlePDF Available

Mining electrical meter data to predict principal building use, performance class, and operations strategy for hundreds of non-residential buildings

Authors:

Abstract and Figures

This study focuses on the inference of characteristic data from a data set of 507 non-residential buildings. A two-step framework is presented that extracts statistical, model-based, and pattern-based behavior. The goal of the framework is to reduce the expert intervention needed to utilize measured raw data in order to infer information such as building use type, performance class, and operational behavior. The first step is temporal feature extraction, which utilizes a library of data mining techniques to filter various phenomenon from the raw data. This step transforms quantitative raw data into qualitative categories that are presented in heat map visualizations for interpretation. In the second step, a random forest classification model is tested for accuracy in predicting primary space use, magnitude of energy consumption, and type of operational strategy using the generated features. The results show that predictions with these methods are 45.6% more accurate for primary building use type, 24.3% more accurate for performance class, and 63.6% more accurate for building operations type as compared to baselines.
Content may be subject to copyright.
Mining electrical meter data to predict principal building use,
performance class, and operations strategy for hundreds of
non-residential buildings
Clayton Millera,b,
, Forrest Meggersc
aBuilding and Urban Data Science (BUDS) Group, Department of Building, National University of Singapore, 117566
Singapore
bInstitute of Technology in Architecture (ITA), Architecture and Building Systems (A/S), ETH Z¨urich, 8093 Z¨urich,
Switzerland
cCooling and Heating for Architecturally Optimized Systems (CHAOS) Lab, Andlinger Center for Energy and
Environment, Dept. of Architecture, Princeton University, Princeton, NJ, 08544, USA
Abstract
This study focuses on the inference of characteristic data from a data set of 507 non-residential
buildings. A two-step framework is presented that extracts statistical, model-based, and pattern-
based behavior. The goal of the framework is to reduce the expert intervention needed to utilize
measured raw data in order to infer information such as building use type, performance class,
and operational behavior. The first step is temporal feature extraction, which utilizes a library
of data mining techniques to filter various phenomenon from the raw data. This step transforms
quantitative raw data into qualitative categories that are presented in heat map visualizations for
interpretation. In the second step, a random forest classification model is tested for accuracy in
predicting primary space use, magnitude of energy consumption, and type of operational strategy
using the generated features. The results show that predictions with these methods are 45.6%
more accurate for primary building use type, 24.3% more accurate for performance class, and
63.6% more accurate for building operations type as compared to baselines.
Keywords: Data mining, Building performance, Performance classification, Energy eciency,
Smart meters
1. Introduction
The built and urban environments have a significant impact on resource consumption and
greenhouse gas emissions in the world. The United States is the world’s second-largest en-
ergy consumer, and buildings there account for 41% of energy consumed1. The most extensive
meta-analysis thus far of non-residential existing buildings showed a median opportunity of 16%
energy savings potential by using cost-eective measures to remedy performance deficiencies
[1]. Simply stated, roughly 6% of the energy consumed in the U.S. could be easily mitigated - a
Corresponding author: Phone: +65 81602452
Email address: clayton@nus.edu.sg (Clayton Miller)
1As of 2014, according to http://www.eia.gov/
Preprint submitted to Energy and Buildings October 6, 2017
figure that would eventually grow to an annual energy savings potential of $30 billion and 340
megatons of CO2by the year 2030. Beyond saving energy, money and mitigating carbon, the
impact of building performance improvement also extends to the health, comfort and satisfaction
of the people who use buildings.
It is mysterious that these performance improvements are not rapidly being identified and im-
plemented on a massive scale across the worlds building stock given the incentives and amount
of research focused on building optimization in the fields of Architecture, Engineering and Com-
puter Science. A comprehensive study of building performance analysis was completed by the
California Commissioning Collaborative (CACx) to characterize the technology, market, and
research landscape in the United States. Three of the key tasks in this project focused on estab-
lishing the state of the art [2], characterizing available tools and the barriers to adoption [3], and
developing standard performance metrics [4]. These reports were accomplished through investi-
gation of the available tools and technologies on the market as well as discussions and surveys
with building operators and engineers. The common theme amongst the interviews and case
studies was the lack of time and expertise on the part of the dedicated operations professionals.
The findings showed that installation time and cost was driven by the need for an engineer to de-
velop a full understanding of the building and systems. These barriers reduce the implementation
of performance improvements.
From these studies, it becomes apparent that the biggest barrier to achieving performance
improvement in buildings is scalability. Architecture is a discipline founded with aesthetic cre-
ativity as a core tenet. Frank Lloyd Wright once stated, “The mother art is architecture. Without
an architecture of our own, we have no soul of our civilization.” Designers rightfully strive for
artistic and meaningful creations; this phenomenon results in buildings with not only distinctive
aesthetics but also unique energy systems design, installation practices and dierent levels of or-
ganization within the data-creating components. This paper shows that an emerging mass of data
from the built environment can facilitate better characterization of buildings by through automa-
tion of meta-data extraction. These data are temporal sensor measurements from performance
measurement systems.
1.1. Growth of Raw Temporal Data Sources in the Built Environment
As entities of analysis, buildings are less on the level of a typical mass-produced manufactured
device in which each unit is the same in its components and functionality; and more on the level
of customers of business, entities that are similar and yet have many nuances. Conventional
mechanistic or model-based approaches, typically borrowed from manufacturing, have been the
status quo in building performance research. As previously discussed, scalability among the
heterogeneous building stock is a significant barrier to these approaches. More appropriate means
of analysis lies in statistical learning techniques more often found in the medical, pharmaceutical
and customer acquisition domains. These methods rely on extracting information and correlating
patterns from large empirical data sets. The strength of these techniques is in their robustness and
automation of implementation - concepts explicitly necessary to meet the challenges outlined.
This type of research on buildings would have been dicult even a few years ago. The creation
and consolidation of measured sensor sources from the built environment and its occupants is oc-
curring on an unprecedented scale. The Green Button Ecosystem now enables the easy extraction
of performance data from over 60 million buildings2. Advanced metering infrastructure (AMI),
2According to http://www.greenbuttondata.org/
2
or smart meters, have been installed on over 58.5 million buildings in the US alone3. A recent
press release from the White House summarizes the impact of utilities and cities in unlocking
these data [5]. It announces that 18 power utilities, serving more than 2.6 million customers, will
provide detailed energy data by 2017. This study also suggests that such accessibility will enable
improvement of energy performance in buildings by 20% by 2020. A vast majority of these raw
data being generated are sub-hourly temporal data from meters and sensors.
1.2. Previous work
A significant amount of work has been undertaken in the field of building characterization
using measured meter data. A comprehensive review of unsupervised learning techniques for
various portfolio analysis and smart meter data was recently completed that includes much of
the previous work in this area [6]. The key studies in the field of building characterization often
deal with segmentation of large numbers of buildings, usually within the realm of smart meter
analytics. Customer segmentation has been studied using various extracted temporal features
from smart meter data for targeting programs [7, 8, 9, 10]. Feature-based clustering of time-
series performance data from building is another key field that precedes the current work. This
field seeks to group various types of buildings or meters into similar clusters for analysis [11,
12, 13, 14, 15, 16, 17, 18]. Various studies have looked at classification of building with various
objectives using temporal meter data as a source of features [19, 20, 21, 16, 22]. Several other
studies have extracted temporal features that enhance the ability to forecast consumption [23, 24,
25]. Several studies have analyzed larger than usual datasets from devices such as water heaters
[26] and retrofit analysis at the city scale [27].
1.3. A Framework for Automated Characterization of Large Numbers of Non-Residential Build-
ings
This paper discusses a framework to investigate which characteristics of whole building elec-
trical meter data are most indicative of various meta-data about buildings among large collections
of commercial buildings. This structure is designed to screen electrical meter data for insight on
the path towards deeper data analysis. The screening nature of the process is motivated by the
scalability challenges previously outlined. An initial component of the methodology was a se-
ries of case study interviews and data collection processes to survey field data from numerous
buildings around the world. A significant portion of this work was completed as part of a Ph.D.
dissertation entitled ”Screening Meter Data: Characterization of Temporal Energy Data from
Large Groups of Non-Residential Buildings” [28].
The contributions of this study are related to its development and testing of a library of tem-
poral machine learning features within the domain of non-residential buildings. To the author’s
best knowledge, no previous study has taken such a large number of buildings (507) and applied
temporal feature engineering approaches from such a wide range of sources. Temporal features
are extracted using techniques such as Seasonal Decomposition of Time Series by Loess (STL)
and Symbolic Aggregate approXimation (SAX) using Vector Space Models (VSM) that have
never been applied to electrical meter data from buildings. This study is also unique in that the
objective is prediction of meta-data about buildings. This target is related to the contemporary
challenge of large, raw temporal datasets from thousands of buildings with a significant amount
of missing information; such is the case with large campuses, portfolios and utility-scale smart
meter implementations.
3As of 2014, according to http://www.eia.gov/tools/faqs/faq.cfm?id=108&t=3
3
2. Methodology
A two-step process is presented as a means of extracting knowledge from whole building
electrical meters. Figure 1 illustrates the intermediate steps in each of the phases. The first
step is to create temporal features that produce quantitative data to describe various phenomenon
occurring in the raw temporal data. This action is intended to transform the data into a more
human-interpretable format and visualize the general patterns in the data. In this step, the data
are extracted, cleaned, and processed with a library of temporal feature extraction techniques
to dierentiate various types of behavior. These features are visualized using an aggregate heat
map format that can be evaluated according to expert intuition, comparison with design intent
metrics, or with outliers detection.
The second step is focused on the characterization of buildings using the temporal features
according to several objectives. This process allows an analyst to understand the impact each
feature has upon the discrimination of each objective. Five test objectives are implemented in
this study: principal building use, performance class, and operations strategy. One of the key
outputs of this supervised learning process is the detection and discussion of what input features
are most important in predicting the various classes. This approach gives exploratory insight
into what features are important in determining various characteristics of a particular building
amongst a large set of its peers. These metadata are building blocks for many other techniques
such as benchmarking, diagnostics and targeting. The motivation for choosing these particular
objectives centers around the consistently available meta-data from the collected case study data
and their relation to various other techniques in the building performance analysis domain.
Figure 1: Overview of data analytics framework
2.1. Case study buildings and collected data
An open data set of 507 whole building electrical meters were utilized in this study for imple-
mentation of the two-step process. These buildings are from university campuses from around
the world. The origin and development of these data are found in Miller and Meggers [29].
A broad range of descriptive statistics and meta-data explanation are available in the previous
literature.
4
2.2. Temporal feature extraction
Feature extraction is an essential process of machine learning and is the means by which
objects are described quantitatively in a way that algorithms can dierentiate between dierent
types or classes. Much of these data are needed when creating an energy simulation model, when
setting thresholds for automated fault detection and diagnostics, or benchmarking a building.
When performing analysis on a single building, this meta-data might be easy to accumulate.
However, when such a process is scaled across hundreds or potentially thousands of buildings, a
collection of these data is not a trivial procedure.
The goal of temporal feature extraction and analysis is to use various techniques to convert
all these qualitative terms into a quantitative domain. For example, the descriptor weather-
dependency can be quantified through the utilization of the Spearman rank order correlation
coecient with outdoor air temperature. Consistency or volatility of daily, weekly, or annual
behavior can be quantified using various pattern recognition techniques. The primary focus of
this study is to create and apply some temporal feature extraction techniques on commercial
buildings for characterization. Figure 2 illustrates a hierarchy of the conventional categories of
temporal features and the new category of temporal features that include a few examples that are
outlined in this study.
Figure 2: Temporal features extracted solely from raw sensor data
Temporal features are aggregations of the behavior exhibited in time-series data. They are
characteristics that summarize sensor information in a way to inform an analyst through visu-
alization or to use as training data in a predictive classification or regression model. Feature
extraction is a step in the process of machine learning and is a form of dimensionality reduction
of data. This process seeks to quantify various qualitative behaviors. This section provides an
overview of the categories of temporal features extracted from the case study building data, the
methods used to implement them, and visualized examples of a selected subset of features man-
ifest themselves over a time range. Table 1 gives an overview the temporal features outlined in
this section.
5
Feature Category General Description
Statistics-based Aggregations of time series data using
mean, median, max, min, standard devia-
tion
Regression model-based Development of a predictive model using
training data and using model parameters
and outputs to describe the data
Pattern-based Extraction of frequent and useful daily,
weekly, monthly, or long-term patterns
Table 1: Overview of feature categories
2.3. Characterization and prediction of meta-data
The primary goal of this research is to get a better sense of what behavior in time-series sensor
data is most characteristic of various types of buildings. As mentioned in the introduction, if
this meta-data can be discriminated, the process of characterizing a building can be automated.
In this section, the operation of using random forest classification models and the input variable
importance feature. An overview of this process is found in Figure 3.
For each objective, two steps are taken to predict each objective and then to investigate the
influence of the input features on class dierentiation. In the first step, a random forest classifi-
cation model is built using subsets of the generated features to predict the objectives class. In the
second phase, the classification model indicates the ability of the temporal features in describing
the class based on its accuracy.
Figure 3: Characterization process to investigate the ability for various features to describe the classification objectives
Random forest classification models were chosen based on their capacity to model diverse
and large data sets in a robust way [30]. These models use an ensemble of decision trees to
predict various characteristic labels about each building based on its features. The literature
describes decision trees as the ”closest to meeting the requirements for serving as an o-the-shelf
procedure for data mining” [31]. Decision trees often over-fit data due to high variance. Random
forest models work by creating a set of decision trees and averaging all of their predictions to
overcome this variance.
6
Random forests use a form of cross-validation by training and testing each tree using a dier-
ent bootstrapped sample from the data. This process produces an out-of-bag error (OOB) that
acts as a generalized error for understanding how well each class can be predicted. This accuracy
is used to determine how well the generated temporal features can delineate the class objectives.
Random forests can also calculate the importance of the input features and how well they lend
themselves to predicting the targets. This attribute is useful in that it allows us to understand ex-
actly which temporal features are most characteristic of various objectives. Variable importance
is calculated using Equation 1. The importance of input feature Xmfor predicting Yby adding
up the weighted impurity decreases p(t)i(st,t) for all nodes twhere Xmis used, averaged over
all NTtrees in the forest [32].
Im p(Xm)=1
NTX
T
X
tT:v(st)=Xm
p(t)i(st,t) (1)
3. Statistics-based features
Statistics-based temporal features are the first and most simplified category of temporal fea-
tures developed. The main classes of features are essential temporal statistics, ratio-based, and
the Spearman rank order correlation coecient.
3.1. Basic statistics
The first set of temporal features to be extracted are basic statistics-based metrics that utilize
the time-series data vector for various time ranges to obtain information using mean, median,
maximum, minimum, range, variance, and standard deviation. Many of these features are devel-
oped through the implementation of the VISDOM package in the R programming language [33].
As a simple example, if a time-series vector is described as X, with Nvalues of X=x1,x2, ..., xn,
the most common statistical metric, mean (or µ), can be calculated using Equation 2 [34].
µ=
N
P
i=1
xi
N(2)
The mean is taken not just for the entire time series, but also from the summer and winter
seasons. The variance of the values is taken for the whole year, the summer and winter seasons as
well. The variance of daily mean, minimum, and maximum values are determined to understand
the breadth of values across the time range. Variance is calculated according to Equation 3.
σ2=
n
X
i=1
(xiµ)2
n(3)
The maximum and minimum electrical demand are calculated. Additionally, the hour and date
at which the maximum demand occurs are determined to understand when peak consumption
occurs. The 97th and 3rd percentiles are calculated to exclude any extreme outliers, a value
that’s often more useful than the maximum and minimum.
A series of hour-of-day (HOD) metrics are calculated that relate to aggregating the behavior
occurring at each of hour of the day. The first of these calculate the most current hour of the
7
top demand of the 10% hottest days and the most common hour of the top 10% temperatures to
inform roughly about cooling energy consumption. These metrics are repeated from the bottom
10% coldest days and temperatures. Another set of twenty-four parameters is calculated to
account directly for the mean demand of each hour of the day.
3.2. Ratio-based statistical features
The second major category of statistical features is ratio-based features. Simply, these are
metrics in which two or more of the previously calculated statistical parameters are combined
as a ratio. These features often have a normalizing eect in which buildings can be more ap-
propriately compared to each other. The first extracted metric of this type is one of the most
commonly calculated for building performance analysis: the consumption magnitude of elec-
tricity normalized by the floor area of the building. This metric seeks to provide a basis for
comparison between buildings and is used as a key metric within numerous benchmarking and
performance analysis techniques.
3.3. Spearman rank order correlation coecient
Another useful metric to calculate is related to how much influence outside air temperature has
on the consumption of a building. Miller et al. describes a process of utilizing the Spearman Rank
Order Correlation (ROC) coecient to approximate the correlation between outside conditions
and the electrical consumption [35]. The ROC essentially ranks the items in two dierent lists
the ratio quantifies whether those lists are correlated positively or negatively. In this case, the
two variables are outside air temperature and electrical consumption. The coecient range is -1
(highly negatively correlated) and +1 (highly positively correlated). If the correlation is positive,
the ROC is positive and closer to +1and the electrical consumption is cooling sensitive as the
consumption goes up with higher temperature. If the correlation is negative, the ROC is negative
and closer to -1, and the time range is heating sensitive as the consumption goes up with lower
temperature.
The correlation coecient can be visualized for a single case as seen in Figure 4. The factor,
in this instance, is calculated individually for each month. This process results in twelve calcula-
tions of the metric using between 29-31 samples. In this case, consumption in January to May is
noticeably more heating sensitive, a fact that can be observed clearly from the line chart, as well
as the one dimension heat map. May to November is more cooling sensitive. It is interesting
that September appears to be the most cooling sensitive month, a fact perhaps related to using
schedules during that month. This coecient is not a perfect indicator of HVAC consumption;
it just detects a correlation. However, it is fast and easy to calculate and is the first phase of
detecting weather dependency.
3.4. Implementation of stats-based features
Figure 5 illustrates the same normalized consumption metric as applied to all of the case study
buildings for three examples of the screening parameters: area-normalized consumption, ratio-
based daily load max vs. min, and monthly Spearman rank order correlation coecient. There
are five segments of buildings based on the primary use types within the set: oces, university
laboratories, college classrooms, primary/secondary schools, and university dormitories. These
metrics are visualized in this way to understand the dierence between each of these use types
for each of the presented metrics. Each row of the heatmap for each segment is the values of
8
Figure 4: Single building example of the spearman rank order correlation coecent with weather
the feature for a single building, while the x-axis is the time range for all buildings. Not all of
the case study buildings have a January to December time range. For these cases, the data was
rearranged so that a continuous set of January to December data is available to be visualized in
the heat map. The aggregation metrics themselves are not calculated with this rearranged vector;
it is only for visualization purposes.
4. Regression model-based features
Semi-physical behavior about a building can be extracted by using performance prediction
models and using output parameters and goodness-of-fit metrics for characterization. This sec-
tion covers the use of several common electrical consumption prediction models to create sets of
temporal features useful for characterization of buildings.
Prediction of electrical loads based on their shape and trends over time is a mature field de-
veloped to forecast consumption to detect anomalies and analyze the impact of demand response
and eciency measures. The most common technique in this category is the use of heating
and cooling degree days to normalize monthly consumption [36]. Over the years, various other
methods have been developed using techniques such as neural networks, ARIMA models, and
more complex regression [37]. However, simplified methods have retained their usefulness over
time due to ease of implementation and accuracy. In the context of temporal feature creation,
a regression model provides various metrics that describe how well a meter conforms to con-
ventional assumptions. For example, if actual measurements and predicted consumption match
well, the underlying behavior of energy-consuming systems in the building has been captured
adequately. If not, there is the uncharacteristic phenomenon that will need to be obtained with a
dierent type of model or feature.
9
Figure 5: Heat map of a selection of statistics-based temporal features: Area-normalized consumption (left), ratio-based
daily load max vs. min (center), and monthly spearman rank order correlation coecient (right)
4.1. Load shape regression-based features
A contemporary, simplified load prediction technique is selected to create temporal features
that capture whether the electrical measurement is simply a function of time-of-week scheduling.
This model was developed by Matthieu et al. and Price and implemented mostly in the context
of electrical demand response evaluation [38, 39]. The premise of the model is based on two
features: a time-of-week indicator and an outdoor air temperature dependence. This model is
also known as the Time-of-week and Temperature or (TOWT) model or LBNL regression model
and is implemented in the eetd-loadshape library developed by Lawrence Berkeley National
Laboratory4.
According to the literature, the model operates as follows [38]. The time of week indicator
is created by dividing each week into a set of intervals corresponding to each hour of the week.
For example, the first interval is Sunday at 01:00, the second is Sunday at 02:00, and so on. The
last, or 168th, interval is Saturday at 23:00. A dierent regression coecient, αi, is calculated
for each interval in addition to temperature dependence. The model uses outdoor air temperature
dependence to divide the intervals into two categories: one for occupied hours and one for un-
occupied. These modes are not necessarily indicators of exactly when people are inhabiting the
building, but merely an empirical indication of when occupancy-related systems are detected to
be operating. Separate piecewise-continuous temperature dependencies are then calculated for
4https://bitbucket.org/berkeleylab/eetd-loadshape
10
each type of mode. The outdoor air temperature is divided into six equally sized temperature
intervals. A temperature parameter, βj, with j=1...6, is assigned to each interval. Within the
model, the outdoor air temperature at time, t, occurring at time-of-week, i, (designated as T(ti))
is divided into six component temperatures, Tc,j(ti). Each of these temperatures is multiplied
by βjand then summed to determine the temperature-dependent load. For occupied periods the
building load, Lo, is calculated by Equation 4.
L0(ti,T(ti)=αi+
6
X
j=1
βjTc,j(ti) (4)
Prediction of unoccupied mode occurs using a single temperature parameter, βu. Unnoccupied
load, Lu, is calculated with Equation 5.
L0(ti,T(ti)=αi+βuTc,j(ti) (5)
The primary means of temporal feature creation from this process is through the analysis of
model fit. The first metric calculated is a normalized, hourly residual, R, that can be used to
visualize deviations from the model. It is calculated from the actual load, La, and the predicted
load, Lp. The residual at a particular hour, t, is calculated using Equation 6.
Rt=Lt,aLt,p
maxLa
(6)
An example of the TOWT model implemented on one of the case study buildings is seen in
Figure 6. Two primary characteristics are captured from a model residual analysis. The first
is the building’s deviation from a set time-of-week schedule and behavior causing the model
to highly over-predict. These deviations are most often attributed to public holidays, breaks in
normal operation, or changes in normal operating modes. In the single building study, one of
the most visible daily deviations, Christmas Day, is observed. This day is significantly over-
predicted due to the model not being informed of the Christmas Day holiday. The automated
capture of this phenomenon can report whether the building is of a certain use-type or in an
individual jurisdiction. The second characteristic obtained are periods of underprediction when
the building is consuming more electricity than expected. These data inform whether a building
is being consistently utilized, or whether there is volatility in its normal operating schedule from
week-to-week.
4.2. Change point model regression
Another means of performance modeling that considers weather characterization is the use of
linear change point models. The outputs of these models can be interpretable in approximating
the amount of energy being used for heating, ventilation, and air-conditioning (HVAC). This type
of model has its basis in the previously-mentioned PRISM method and has been continuously
utilized, recently by Kissock and Eger [40]. This multivariate, piece-wise regression model is
developed using daily consumption and outdoor air dry-bulb temperature information. A linear
regression model is fitted to data detected to be correlated with outdoor dry-bulb air temperature,
either positively for cooling energy consumption or negatively for heating energy consumption.
For example, as the outdoor air temperature climbs above a certain point, the relationship be-
tween electricity consumption and every degree increase in temperature should be a straight line
with a certain slope if the building has an electrically-driven cooling system. The point at which
11
Figure 6: Single building example of TWOT model with hourly normalized residuals
this change occurs is considered the cooling balance point of the building, and the slope of the
line is the rate of cooling energy increase due to outdoor air conditions.
Equations 7 and 8 are used to predict energy consumption based on an outdoor air temperature,
T. This equation can also predict the heating (β2(Tβ3)) or cooling (β2(β3T)) components of
the electrical consumption to a certain level of accuracy.
Ec=β1+β2(Tβ3) (7)
Eh=β1+β2(β3T) (8)
4.3. Seasonality and trend decomposition
Temporal data from dierent sources often exhibit similar types of behavior that are stud-
ied within the field of forecasting and temporal data mining. Electrical building meter data fits
into this category, and the same feature extraction techniques can be applied as what is com-
monly done for financial or social science analysis. These techniques often seek to decompose
time-series data into several components that represent the underlying nature of the data [34].
For example, the electrical meter data collected from buildings is often cyclical in its weekly
schedule. People are utilizing buildings each day of the week in a relatively predictable pattern.
A prevalent example of this behavior is found in oce buildings where occupants are typical
white-collar professionals who come into work on weekdays at a particular time and leave to go
home at a certain time. Weekends are unoccupied periods in which there is little to no activity.
This behavior is an example of what’s known as seasonality within time series analysis. Season-
ality is a fixed and known period of consistent modulation and is a feature that is often extracted
before creating predictive models.
12
Trends are another feature commonly found in temporal data. A trend is a long-term increase
or decrease in the data that often doesn’t follow a particular pattern. Trends are commonly due
to factors that are less systematic than seasonality and are often due to external influences. For
building energy consumption, trends manifest themselves as gradual shifts in consumption over
the course of weeks or months. Often these variations are due to weather-related factors influ-
encing the HVAC equipment. Other causes of trends are changes in occupancy of degradation of
system eciency.
To capture these features to understand their impact on characterizing buildings, the seasonal-
trend decomposition procedure based on loess is used to extract each of these features from
the case study buildings [41]. This process is used to remove the weekly seasonal patterns
from each building, the long-term trend over time, and the residual remainders from the model
developed by those two components. The input data is aggregated to daily summations and
weather-normalized by subtracting the calculated heating and cooling elements from the change
point model described in Section 4.2. This step is done to reduce the influence weather plays in
the trend decomposition. The STL package in R is used for this process to extract the seasonal,
trend, and irregular components 5.
The details of the internal algorithms of the STL procedure are described by Cleveland et al.
[41]. The process uses an inner loop of algorithms to detrend and deseasonalize the data by
creating a trend component, Tv, and a seasonal component, Sv. The remainder component, Rv, is
a subtraction of the input values, Yvas seen in Equation 9.
Rv=YvTvSv(9)
4.4. Implementation of model-based features
Figure 7 illustrates an overview of an implementation of three examples of model-based fea-
tures on all the buildings across the various building use types in the study. The heat map at the
far left illustrates normalized residuals from the load shape regression model. The dierences
between each use type can be noticed from a high level due to the nature of residuals. The
darker areas of the visualization indicate when the model is highly over-predicting consumption
and lighter areas indicate when the model is under-predicting. Typical holiday periods such as
spring, summer and winter breaks and holidays such as the American Labor Day and Thanksgiv-
ing are seen as darker areas. Oces, labs and classrooms seem to have similar residual patterns,
likely due to their scheduling being similar. Slight fundamental dierences are seen such as
the fact that classrooms have more general areas of over-prediction, likely due to less consistent
occupancy. Primary/Secondary schools and dormitories are less predictable on an annual basis
due to their strong seasonal patterns of use; this fact is intuitive, and model residuals of this type
are accurate in automatically characterizing this behavior. The center figure illustrates heating
energy regression for all case study buildings. These figures have been normalized according to
floor area. Each building’s response to outdoor air temperature is indicative of the type of sys-
tems installed in addition to the eciency of energy conversion of those systems. The far right
heat map illustrates the trend decomposition as applied to the entire case study set of buildings.
Oces appear to have quite a bit of diversity over time, with a few observable systematic low
spots in the spring and autumn periods at the bottom of the heat map. Laboratories reflect that
behavior, while university visually has an opposite eect with less than the average trend in the
5https://stat.ethz.ch/R-manual/R-devel/library/stats/html/stl.html
13
summer months. Primary/Secondary school classrooms have a very distinct delineation between
when school is in session and out of session during the summer and various breaks. As many of
these schools are in the UK, their out-of-session periods appear to line up naturally. University
dormitories also have clear delineations between occupied and unoccupied periods, and they also
seem to match up quite well, despite the diversity of data sources of these buildings.
Figure 7: Heat map of a selection of model-based temporal features: Daily normalized residuals from load shape regres-
sion models (left), heating energy prediction using change point model regression (center), and seasonal trends using the
STL package (right)
5. Pattern-based features
The third category of temporal features that is developed in this study is related to capturing the
typical and atypical patterns of use from building performance data. The goal of these features
is to quantify whether a building has consistency on a daily or weekly basis, whether certain
building types have certain types of patterns of use and to inform how they can be used to predict
various kinds of meta-data. In temporal feature mining, two concepts are relevant in this analysis:
motifs and discords. A motif is a typical pattern that occurs on a regular basis within a data set
[42]. A discord is an unusual pattern within a data set that identifies infrequent behavior [43].
Several of the temporal features developed in this process is designed to leverage these concepts.
The pattern-based feature categories outlined in this section include diurnal pattern extraction,
pattern specificity, and long-term consistency.
14
5.1. Dirunal pattern extraction
The first temporal feature outlined is based on the DayFilter process which extracts the motifs
and discords from raw meter data based on 24 hour periods [44]. This process heavily utilizes
the Symbolic Aggregate approXimation (SAX) representation of time-series data [45]. SAX is
a process of time-series data discretization that converts temporal data into the string data type.
This process empowers various text mining and visualization techniques. The primary feature
extracted from this process for this study is diurnal pattern frequency which quantifies the number
and size of motifs found from a particular meter.
5.2. Pattern specificity
Another method to leverage SAX to characterize the case study data is to use it to extract
which patterns are most indicative of a particular building use type. This information is obtained
using the SAX-VSM process pioneered by Senin and Malinchik that uses SAX and Vector Space
Model technique from the text mining field [46]. Conventionally this method is utilized as a clas-
sification model to predict which class a certain time-series belongs. A by-product of the process
is that the subsequences of each data stream are assigned a metric indicating their specificity. Pat-
tern specificity is a concept that quantifies how well a meter fits within its class. This technique
is used to determine whether a building is operating similar to other supposed peer buildings of
the same type.
5.3. Long-term pattern consistency
The concept of long-term consistency is related to how volatile a building’s electrical con-
sumption is over the course of a long-range period such as one year. A building that is consid-
ered more volatile will have significant shifts in steady-state operation over the process of a year.
Often these changes are related to seasonality of scheduling that can be the case in buildings like
schools and universities. A less volatile building will be more consistent in overall magnitude
of consumption over the course of a year. This behavior is more often the case in oces and
laboratories. In this analysis, a concept known as breakout detection is utilized to quantify the
dierence between these behaviors. A metric is created to detect the number of shifts in relative
steady-state over the course of the time range. This metric was developed in a previous study
focused on data from a single campus [35]. An R programming package, BreakoutDetection, is
used to create this parameter. This package was developed by the social media company Twitter
to process their time-series data6. The details of the algorithms utilized in this package can be
found in a study by James et al. [47].
Figure 8 illustrates the breakout detection process from a single building data stream. This
dataset includes hourly data from an entire year from a university dormitory building. A min-
imum threshold of 30 days is chosen in this case, which explains the lack of threshold shift in
April, a break that may be attributed to spring break for this building. Seven total steady-state
variations were detected by the algorithm in this case, and many of them occur in the conven-
tional university scheduled summer, spring, fall and winter breaks.
6https://github.com/twitter/BreakoutDetection
15
Figure 8: Single building example of breakout detection to test for long-term volatility in an university dormitory build-
ing.
5.4. Implementation of pattern-based features
Figure 9 shows three pattern-based features as applied to all the case study buildings. The
far right heat map shows the pattern frequency metric as applied to all the case study buildings
extracted from the DayFilter process. One will notice that there is a range of pattern frequencies
occurring across each of the building use types. Oces and Primary/Secondary Classrooms
seem to have larger regions of darker, more consistent behavior. Labs and Classrooms seem to
be more volatile across the time ranges. The center heat map illustrates breakout detection across
the building use types in this study. This implementation uses the same input parameter of a 30
day minimum between breakouts. One notices somewhat of consistency among oces, labs,
and classrooms regarding the distribution of breakout numbers, while university dormitories and
primary/secondary classrooms have a noticeably higher number of breakouts across the range
of behavior. The far right heat map illustrates the daily specificity calculation process applied
to all 507 case studies as divided among the use types. Clear dierences in patterns across the
time ranges are visible for each of the building use types. Oces, university laboratories, and
university classrooms all seem to have similar phases of specificity at similar times of the year,
while their breaks often dierentiate dorms and primary/secondary schools.
6. Prediction of building Use, performance class, and operational strategy
Visualization of temporal features on their own is a means of understanding the range of values
of the various phenomenon across a time range. This situation gives an analyst the basis to begin
understanding what discriminates a building based on dierent objectives. The next step is to
utilize the features to predict whether a building falls into a particular category and test the
importance of various elements in making that prediction. Understanding which features are
most characteristic to a particular objective is the fundamental tenet of this study. In this section,
three classification objectives are tested:
1. Principle Building Use - The primary use of the building is designated for the principal
activity conducted by percentage of space designated for that activity. It is rare for a building
16
Figure 9: Heat map of a selection of pattern-based temporal features: daily pattern frequency from the DayFilter process
(left), breakout detection for long term volatility (center), and in-class specificity using the SAX-VSM process (right)
to be devoted specifically to a single task, and mixed-use buildings pose a specific challenge
to prediction.
2. Performance Class - Each building is assigned to a particular performance class according
to whether its area-normalized consumption in the bottom, middle, or top 33% percentiles
within its principle building use-type class.
3. General Operation Strategy - Buildings that are controlled by the same entity, such as those
on a University campus, often have similar schedules, operating parameters, and use pat-
terns. This objective tests to understand how distinct these dierences are between dierent
campuses.
6.1. Principal building use
The first scenario investigated is the characterization of primary building use type. The goal of
this eort is to quantify what temporal behavior is most characteristic in a building being used
for a certain purpose. For example, what makes the electrical consumption patterns of an oce
building unique as compared to other purposes such as a convenience store, airport, or laboratory.
This objective is necessary to understand who are the peers of a building. Whatever category a
building is assigned determines what benchmark is used to determine the performance level of
a building. The EnergyStar Portfolio Manager is the most common benchmarking platform in
the United States and the first step in its evaluation is identifying the property type. There are
80 property types in portfolio manager and each one is devoted to a particular primary building
17
use type. Twenty-one of those property types are available for submission to achieve a 1-100
ENERGYSTAR score in the United States.
Allocation of the primary use type of a building is often considered a trivial activity when
analyzed from a smaller set of buildings. As the number of building being analyzed grows, so
does the complexity of space use evaluation. The use of buildings changes over time and these
changes are not always documented. In several of the case studies, this topic was discussed and
highlighted as an issue concerning benchmarking a building.
Discriminatory features have already been visualized extensively in previous sections and the
dierences between the primary use types are apparent in the overview heat maps of each feature.
Figure 10 is the first such example of the output results of the classification model in predicting
the building’s primary use type using the temporal features created in this study. This visualiza-
tion is a kind of error matrix, or confusion matrix, that illustrates the performance of a supervised
classification algorithm. The y-axis represents the correct label of each classification input and
the x-axis is the predicted label. An accurate classification would fall on the left-to-right diagonal
of the grid. This grid is normalized according to the percentage of buildings within each class.
The model was built using the scikit-learn Python library7with the number of estimators set to
100 and the minimum samples per leaf set to 2. The overall general accuracy of the model is
67.8% as compared to a baseline model of 22.2%. The baseline model using a stratified strategy
in which categories are chosen randomly based on the percentage of each class occurring in the
training set. Based on the analysis, university dormitories and primary/secondary classrooms
are the best-characterized use types overall with a precision of 92% and 96% respectively and
accuracies of 74% and 75%. The oce category is easily confused with university classrooms
and laboratories. This situation is not surprising as many of these facilities are quite similar and
uses within these categories often overlap.
Previously, an example of how to characterize building use type was illustrated using a
random forest model and various feature importance techniques. In this subsection, a discussion
is presented of how this sort of characterization can be useful in a practical setting. In the case
study interviews, the topic of benchmarking of buildings was discussed. One of the issues
presented to the operations teams was the concept of not having a complete understanding of
the way the buildings on their campus were being used. For example, several of the campuses
have a spreadsheet outlines various metadata about the facilities on campus. This worksheet, in
many cases, includes the primary use type of the building. It was found that this primary use
type designation is often loosely based on information from when the building was constructed
or through informal site survey. In other situations, the building has an accurate breakdown
of all the sub-spaces in the building and approximately for what the spaces are being used.
In these discussions, the idea was presented that building use type characterization could be
used to determine automatically whether the labels within these spreadsheets aligned with the
patterns of use characterization using the temporal feature extraction. This proposal was met
some positive feedback, albeit there was a hesitation to confirm fully that this process would be
entirely necessary if labor were directed to do the same task.
Many of the case study subjects then were shown a series of graphics designed to tell the
story of building use type characterization in an automated way. Figure 11 is the first graphic
shown to the subjects. Each of the variables visualized in this figure has been scaled within
7http://scikit-learn.org/
18
Figure 10: Classification error matrix for prediction of building use type using a random forest model
their ranges, which causes the most extreme values to occur at the minimum and maximum of
the y-axis. This figure illustrates several of the most easily understood temporal features and
how they break down across the various building use types. This graphic was created using
the data for a particular case study; therefore more separation between the classes exist than
in the prediction of classes found in the previous section. Discussions using this graphic first
centered around the first feature: Daily Magnitude per Area. It was intuitive to most participants
that a university laboratory has more and primary/secondary schools have less consumption per
area than the other use types. It is more surprising, however, that certain building use types are
characterized well by other features, such as a number of breakouts with primary/secondary
schools and daily and weekly specificity with university dormitories. Outlier buildings for each
of the primary use types can be found for all of the variables; this occurrence is natural in the
construction industry, and these have not been filtered.
19
Figure 11: Simplified breakdowns of general features according to building use type that were presented to case study
subjects
6.2. Characterization of building performance class
The second objective targeted in this study is the ability for temporal features to characterize
whether a building performs well or not within it use-type class. Consumption is the metric being
measured; therefore it’s not the goal of this analysis to predict the performance of a building, its
to determine which temporal characteristics are correlated with good or poor performance. This
eort is related to the process of benchmarking buildings. Using the insight gained through char-
acterization of building use type, it is possible to inform whether a building’s behavior matches
its peers. Once a building is part of a peer group, its necessary to understand how well that build-
ing performs within that group. In this section, the case study buildings are divided according to
which percentile each fits within in its in-class performance. The buildings are divided accord-
ing to percentiles, with those in the lowest 33% are classified as ”Low,” the 33 to 66% percentile
are ”Intermediate,” and the top 33% are classified as ”High.” As in the previous section, these
classifications and a subset of temporal features are implemented into a random forest model to
understand how well the features are at characterizing the dierent classes. Since this objec-
tive is related to consumption, all input features with known correlations to consumption were
removed from the training set. These include the prominent features of consumption per area,
but also include many of the statistical metrics such as maximum and minimum values. Most of
the daily ratio input features remain in the analysis as they are not directly correlated with total
consumption. Figure 12 illustrates the results of the model in an error matrix. It can be seen
that high and low consuming buildings are well characterized. The intermediate buildings have
higher error rates and are often misclassified with the other two classes. The overall accuracy of
the model for classification is 62.3% as compared to a baseline of 38%.
In a situation similar to the discussion about building use type, participants in the case studies
were guided through the process of analysis using a subset of features from buildings on their
campus. Figure 13 illustrates a graphic that was shown to the groups. In this case, the buildings
20
Figure 12: Classification error matrix for prediction of performance class using a random forest model
are divided into two classes: Good and Bad. These categories are based on whether the building
falls in the upper or lower 50% within its class. The first observation by the case study partici-
pants is that the load diversity, or the daily maximum versus minimum, is a reliable indicator of
the performance class. This fact is not surprising as this metric indicates the magnitude of the
base load consumption as compared to the peak. Other relatively strong dierentiators, in this
case, are cooling energy, seasonal changes, and weekly specificity. The discussions related to
this graphic centered around the potential for the temporal features to inform why a building is
performing well or not.
Figure 14 illustrates another graphic related to building consumption classes that were dis-
cussed with case study participants. This graphic is an overview of the distributions of the sim-
plified set of features for a certain campus as compared to the entire set of case study buildings.
This graphic shows where the buildings on this campus stand as compared to their peers. In this
case, the buildings are on the higher end of the normalized consumption, which could likely be
because they’re also almost all in the most top 20% of buildings for heating energy consumption.
21
Figure 13: Simplified breakdowns of general features according to performance level that were presented to case study
subjects
The buildings also have a relatively high load diversity, thus the base loads for this campus are
likely higher than average and interventions could be designed to reduce this unoccupied load.
Many of the case study participants saw this insight as useful as it supplements the information
from benchmarking.
Figure 14: Feature distributions of a single campus as compare to all other case study buildings
6.3. Characterization of operational strategies
The final characterization objective for the case studies is the ability for the temporal features
to classify buildings from the same campus, and thus buildings that are being operated in similar
ways. This characterization takes into account the similarity in occupancy schedules, patterns of
22
use, and other factors related to how a building performs. Like the performance classes, this type
of classification is more important in understanding the features that contribute to the dierentia-
tion, rather than the classification itself. Seven campuses were selected from the 507 buildings to
create seven groups of buildings to characterize the dierence between their operating behavior.
Features were removed for this objective that are indicators of weather sensitivity as these would
be related to the location of the buildings, and thus, the campus that they’re located. Figure 15
illustrates the results from the random forest model trained on these data. The accuracy of this
model is 80.5% as compared to a baseline of 16.9%. The model is excellent at predicting some of
the groups, such as groups 1-4, which more deficient in others, such as 5-7. The high accuracy of
this prediction is surprising and lends itself to the ability of the temporal features and the random
forest model to predict the operational normalities of these buildings.
Figure 15: Classification error matrix for prediction of operations group type using a random forest model
23
7. Conclusion
This paper was undertaken with objectives related to the characterization of building behavior
using temporal feature extraction and variable importance screening. The primary goal of the
eort is to automate the process of predicting various types of meta-data.
A framework of analysis was developed to address and test this eort. This process was
implemented on two sets of case study buildings and the key quantitative conclusions include:
The framework can characterize primary building use type with a general accuracy of 67.8%
as compared to a baseline model of 22.2% based on five use type classes. Temporal features
enable a three-fold increase in building use prediction. Pattern-based features are the most
common category in the top ten in the characterization of use-type, thus are important dif-
ferentiators as compared to more traditional features. Features from the stl decomposition
process were found to be important as well due to the ability to distinguish dierences in
normalized weekly patterns.
Building performance class overall accuracy of the model for classification is 62.3% as
compared to a baseline of 38%. The top indicator of high versus low building in-class
performance was temporal features pattern specificity. Once again, pattern-based temporal
features were found to be significant in distinguishing between dierent types of behavior.
For operations class, the accuracy of this model is 80.5% as compared to a baseline of
16.9%, a four-fold increase. Daily scheduling of buildings was captured using the DayFilter
features, accounting for half of the entire input features.
7.1. Open data and reproducible research
The source code and analytics workflow of this paper can be found in a series of Jupyter
notebooks found in a GitHub repository8. The data set that is utilized for the analysis is the
Building Data Genome Project9. These analysis files can be downloaded, and much of the work
replicated.
8. Acknowledgements
The authors would like to thank all of the building operations and maintentance professionals
from around the world that assisted in the gathering of the data utilized. This study was funded
by a Fellowship from the Institute of Technology in Architecture (ITA) at the ETH Z ¨
urich.
9. References
[1] E. Mills, Building commissioning: a golden opportunity for reducing energy costs and greenhouse gas emissions
in the United States, Energy Eciency 4 (2011) 145–173.
[2] M. Enger, H. Friedman, D. Moser, Building Performance Tracking in Large Commercial Buildings: Tools and
Strategies - Subtask 4.2 Research Report: Investigate Energy Performance Tracking Strategies in the Market, Tech-
nical Report, 2010.
8https://github.com/buds-lab/temporal-features-for-nonres-buildings-library
9http://www.buildingdatagenome.org/
24
[3] J. Ulickey, T. Fackler, E. Koeppel, J. Soper, Building Performance Tracking in Large Commercial Buildings:
Tools and Strategies - Subtast 4.3 Characterization of Fault Detection and Diagnostic (FDD) and Advanced Energy
Information System (EIS) Tools, Technical Report, 2010.
[4] E. Greensfelder, H. Friedman, E. Crowe, Building Performance Tracking in Large Commercial Buildings: Tools
and Strategies - Subtask 4.4 Research Report: Characterization of Building Performance Metrics Tracking Method-
ologies, Technical Report, 2010.
[5] The White House, FACT SHEET: Cities, Utilities, and Businesses Commit to Unlocking Access to Energy Data
for Building Owners and Improving Energy Eciency (2016).
[6] C. Miller, Z. Nagy, A. Schlueter, A review of unsupervised statistical learning and visual analytics techniques
applied to performance analysis of non-residential buildings, Renewable and Sustainable Energy Reviews (2017).
[7] A. Albert, M. Maasoumy, Predictive segmentation of energy consumers, Applied Energy 177 (2016) 435–448.
[8] A. Albert, R. Rajagopal, R. Sevlian, Segmenting Consumers Using Smart Meter Data, in: Proceedings of the
Third ACM Workshop on Embedded Sensing Systems for Energy-Eciency in Buildings, BuildSys ’11, ACM,
New York, NY, USA, 2011, pp. 49–50.
[9] S. Borgeson, Targeted Eciency: Using Customer Meter Data to Improve Eciency Program Outcomes, PhD,
University of California, Berkeley, Berkeley, CA, USA, 2013.
[10] J. Kwac, R. Rajagopal, Demand response targeting using big data analytics, in: Big Data, 2013 IEEE International
Conference on, IEEE, 2013, pp. 683–690.
[11] T. Rasanen, M. Kolehmainen, Feature-based clustering for electricity use time series data, in: Adaptive and
Natural Computing Algorithms: 9th International Conference, ICANNGA 2009, Springer, Kuopio, Finland, 2009,
pp. 401–412.
[12] F. Iglesias, W. Kastner, Analysis of Similarity Measures in Times Series Clustering for the Discovery of Building
Energy Patterns, Energies 6 (2013) 579–597.
[13] S. Petcharat, S. Chungpaibulpatana, P. Rakkwamsuk, Assessment of potential energy saving using cluster analysis:
A case study of lighting systems in buildings, Energy and Buildings 52 (2012) 145–152.
[14] A. Lavin, D. Klabjan, Clustering time-series energy data from smart meters, Energy Eciency 8 (2014) 681–689.
[15] G. Chicco, I. S. Ilie, Support vector clustering of electrical load pattern data, IEEE Transactions on Power Systems
24 (2009) 1619–1628.
[16] S. M. Bidoki, N. Mahmoudi-Kohan, M. H. Sadreddini, M. Zolghadri Jahromi, M. P. Moghaddam, Evaluating
dierent clustering techniques for electricity customer classification, in: Transmission and Distribution Conference
and Exposition, 2010 IEEE PES, IEEE, New Orleans, LA, USA, 2010, pp. 1–5.
[17] I. Panapakidis, M. Alexiadis, G. Papagiannis, Evaluation of the performance of clustering algorithms for a high
voltage industrial consumer, Engineering Applications of Artificial Intelligence 38 (2015) 1–13.
[18] S. P. Pieri, IoannisTzouvadakis, M. Santamouris, Identifying energy consumption patterns in the Attica hotel sector
using cluster analysis techniques with the aim of reducing hotels CO2 footprint, Energy and Buildings 94 (2015)
252–262.
[19] H. X. Zhao, F. Magoules, A review on the prediction of building energy consumption, Renewable and Sustainable
Energy Reviews 16 (2012) 3586–3592.
[20] S. V. Verdu, M. O. Garcia, C. Senabre, A. G. Marin, F. J. G. Franco, Classification, Filtering, and Identification
of Electrical Customer Load Patterns Through the Use of Self-Organizing Maps, IEEE Transactions on Power
Systems 21 (2006) 1672–1682.
[21] A. R. Florita, L. J. Brackney, T. P. Otanicar, J. Robertson, Classification of Commerical Building Electrical Demand
Profiles for Energy Storage Applications, in: Proceedings of ASME 2012 6th International Conference on Energy
Sustainability & 10th Fuel Cell Science, Engineering and Technology Conference (ESFuelCell2012), San Diego,
CA, USA.
[22] T. G. Nikolaou, D. S. Kolokotsa, G. S. Stavrakakis, I. D. Skias, On the Application of Clustering Techniques
for Oce Buildings’ Energy and Thermal Comfort Classification, IEEE Transactions on Smart Grid 3 (2012)
2196–2210.
[23] J. Ploennigs, B. Chen, P. Palmes, R. Lloyd, e2-Diagnoser: A System for Monitoring, Forecasting and Diagnosing
Energy Usage, in: Data Mining Workshop (ICDMW), 2014 IEEE International Conference on, IEEE, Shenzhen,
China, 2014, pp. 1231–1234.
[24] A. Shahzadeh, A. Khosravi, S. Nahavandi, Improving load forecast accuracy by clustering consumers using smart
meter data, in: Proceedings of the 2015 International Joint Conference on Neural Networks (IJCNN), Killarney,
Ireland, pp. 1–7.
[25] A. Reinhardt, S. Koessler, PowerSAX: Fast motif matching in distributed power meter data using symbolic repre-
sentations, in: Proceedings of 9th IEEE International Workshop on Practical Issues in Building Sensor Network
Applications (SenseApp 2014), IEEE, Edmonton, Canada, 2014, pp. 531–538.
[26] Z. Liu, H. Li, K. Liu, H. Yu, K. Cheng, Design of high-performance water-in-glass evacuated tube solar water
heaters by a high-throughput screening based on machine learning: A combined modeling and experimental study,
25
Solar Energy 142 (2017) 61–67.
[27] Y. Chen, T. Hong, M. A. Piette, City-Scale Building Retrofit Analysis: A Case Study using CityBES, in: Building
Simulation 2017, San Francisco, CA, USA.
[28] C. Miller, Screening Meter Data: Characterization of Temporal Energy Data from Large Groups of Non-Residential
Buildings, Ph.D. thesis, ETH Zrich, Zurich, Switzerland, 2017.
[29] C. Miller, F. Meggers, The Building Data Genome Project: An open, public data set from non-residential building
electrical meters, Energy Procedia 122 (2017) 439–444.
[30] L. Breiman, Random forests, Machine learning 45 (2001) 5–32.
[31] T. Hastie, R. Tibshirani, J. H. Friedman, The elements of statistical learning: data mining, inference, and prediction,
Springer series in statistics, Springer, New York, NY, 2nd ed edition, 2009.
[32] G. Louppe, L. Wehenkel, A. Sutera, P. Geurts, Understanding variable importances in forests of randomized trees,
in: Advances in Neural Information Processing Systems, pp. 431–439.
[33] S. Borgeson, J. Kwac, visdom: R package for energy data analytics, 2015. R package version 0.9.
[34] T. Mitsa, Temporal Data Mining, Chapman and Hall/CRC, 2010.
[35] C. Miller, A. Schlueter, Forensically discovering simulation feedback knowledge from a campus energy informa-
tion system, in: Proceedings of the 2015 Symposium on Simulation for Architecture and Urban Design (SimAUD
2015), SCS, Washington DC, USA, 2015, pp. 33–40.
[36] M. F. Fels, PRISM: an introduction, Energy and Buildings 9 (1986) 5–18.
[37] J. W. Taylor, L. M. De Menezes, P. E. McSharry, A comparison of univariate methods for forecasting electricity
demand up to a day ahead, International Journal of Forecasting 22 (2006) 1–16.
[38] P. Price, Methods for Analyzing Electric Load Shape and its Variability, Lawrence Berkeley National Laboratory
(2010).
[39] J. L. Mathieu, P. N. Price, S. Kiliccote, M. A. Piette, Quantifying changes in building electricity use, with applica-
tion to demand response, Smart Grid, IEEE Transactions on 2 (2011) 507–518.
[40] J. K. Kissock, C. Eger, Measuring industrial energy savings, Applied Energy 85 (2008) 347–361.
[41] R. B. Cleveland, W. S. Cleveland, J. E. McRae, I. Terpenning, STL: A seasonal-trend decomposition procedure
based on loess, Journal of Ocial Statistics 6 (1990) 3–73.
[42] P. Patel, E. Keogh, J. Lin, S. Lonardi, Mining motifs in massive time series databases, in: Proceedings of 2002
IEEE International Conference on Data Mining (ICDM 2002), Maebashi City, Japan, pp. 370–377.
[43] E. J. Keogh, J. Lin, A. Fu, Hot sax: Eciently finding the most unusual time series subsequence, in: Proceedings
of the Fifth IEEE International Conference on Data Mining (ICDM05), IEEE, Houston, TX, USA, 2005.
[44] C. Miller, Z. Nagy, A. Schlueter, Automated daily pattern filtering of measured building performance data, Au-
tomation in Construction 49, Part A (2015) 1–17.
[45] J. Lin, E. J. Keogh, S. Lonardi, B. Chiu, A symbolic representation of time series, with implications for stream-
ing algorithms, in: Proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and
Knowledge Discovery (DMKD ’03), ACM, San Diego, CA, USA, 2003, pp. 2–11.
[46] P. Senin, S. Malinchik, SAX-VSM: Interpretable Time Series Classification Using SAX and Vector Space Model,
in: 2013 IEEE 13th International Conference on Data Mining, Institute of Electrical & Electronics Engineers
(IEEE), 2013.
[47] N. A. James, A. Kejariwal, D. S. Matteson, Leveraging Cloud Data to Mitigate User Experience from Breaking
Bad, arXiv preprint arXiv:1411.7955 (2014).
26
... As constructive criticism, it could be said that the scope of the method becomes too wide without allowing researchers to delve very deeply into each aspect. Alternatively, ref. [5] present a framework combining the capabilities of data mining and machine learning to predict different parameters for non-residential buildings. Alternatively, other approaches such as that of [6] demonstrate that data analysis techniques and machine learning algorithms can be helpful when implemented in other scenarios such as in the transport network. ...
... The results show that the main challenges that remain are the image segmentation and finding semantic meanings of an image. Finally, ref. [5] presented a two-step framework that is able to identify the statistics and the inner pattern of the analyzed data harnessing machine learning capabilities. Their main aim was to reduce the expert's intervention when utilizing measured raw data to infer different types of information regarding non-residential buildings such as performance class, operational behavior, or building use type. ...
... Alternatively, Miller and Meggers (2017) [5] implemented data retrieval techniques and machine learning for infrastructure costs. However, their approach focused on predicting the future costs of the infrastructure based on the prices of the already registered data. ...
Article
Full-text available
The capability of extracting information and analyzing it so that it is in a common format is essential for performing predictions, comparing projects through cost benchmarking, and having a deeper understanding of the project costs. However, the lack of standardization and the manual inclusion of data make this process very time-consuming, unreliable, and inefficient. To tackle this problem, a novel approach with a big impact is presented combining the benefits of data mining, statistics, and machine learning to extract and analyze the information related to railway infrastructure cost data. To validate the suggested approach, data from 23 real historical projects from the client network rail were extracted, allowing their costs to be comparable. Finally, some machine learning and data analytics methods were implemented to identify the most relevant factors allowing cost benchmarking to be performed. The presented method proves the benefits of data extraction for gathering, analyzing, and benchmarking each project in an efficient manner, and to develop a deeper understanding of the relationships and the relevant factors that matter in infrastructure costs.
... a direct relationship between operations strategy and performance (Hossain et al. 2012), operations strategy must be consistent with competitive strategy in improving the performance of the company (Subroto & Alhabsji 2014). Operations strategy is more accurate for performance (Miller & Meggers 2017), and there is a positive and significant relationship between cost leadership, differentiation and market segmentation strategies, and competitiveness of media houses in Kenya (Mokeira 2014). Operations strategy can enhance the performance of project-based organizations (Koch et al. 2015). ...
... In the framework of this research, it can be explained in detail that there are several influences, namely the influence between SCM variables with operations strategy, IT with operations strategy, HR with operations strategy, and operations strategy with performance. Based on the results of the research above, it can be summarized as follows: Operations strategy (cost, quality, flexibility, and delivery) has a positive and significant impact on supply chain integration (Kumar et al. 2020), (Benitez-Amado et al. 2013) obtain research results that IT has an effect on operations strategy through HR. (Natumanya 2015), (Benitez-Amado et al. 2013) stated that HR affects operations strategy, and operations strategy is more accurate for performance (Miller & Meggers 2017). Based on the conceptual model, the four hypotheses will be shown in Table 1. ...
... a direct relationship between operations strategi and performance (Hossain et al. 2012), operations strategy must be consistent by competitive strategy in improving the performance in the company (Subroto & Alhabsji 2014). Operations strategy being more accurate for performance (Miller & Meggers 2017), there is a positive and significant relationship among cost leadership, differentiation and market segmentation strategies, and competitiveness of media houses in Kenya (Mokeira 2014). Operations strategy can enhance the performance of project-based organizations (Koch et al. 2015). ...
... As constructive criticism, it could be said that the scope of the method becomes too wide without allowing to delve very deeply into each aspect. Alternatively, (Miller & Meggers, 2017) presents a framework combining the capabilities of data mining and machine learning to predict different parameters for non-residential buildings. ...
... The development of a prototype based on sensor web technologies (Deb & Zhang, 2004) To review the extract of information using contentbased image retrieval techniques A systematic review analyzing a group of selected papers with content-based image retrieval systems. (Miller & Meggers, 2017) To predict the building use, performance, and operations ...
... Finally, (Miller & Meggers, 2017) presents a two-step framework able to identify the statistics and the inner pattern of the analyzed data harnessing machine learning capabilities. Their main aim is to reduce the expert intervention to utilized measured raw data to infer different types of information of the non-residential buildings such as performance class, operational behavior or building use type. ...
Preprint
Full-text available
The capability of extracting information and analyze it into a common format is essential for performing predictions, comparing projects through cost benchmarking, and for having a deeper understanding of the project costs. However, the lack of standardization and the manual inclusion of the data makes this process very time-consuming, unreliable, and inefficient. To tackle this problem, a novel approach with a big impact is presented combining the benefits of data mining, statistics, and machine learning to extract and analyze the information related to railway costs infrastructure data. To validate the suggested approach, data from 23 real historical projects from the client network rail was extracted, allowing their costs to be comparable. Finally, some machine learning and data analytics methods were implemented to identify the most relevant factors allowing for costs benchmarking. The presented method proves the benefits of data extraction being able to gather, analyze and benchmark each project in an efficient manner, and deeply understand the relationships and the relevant factors that matter in infrastructure costs.
... The dataset was developed primary to test various algorithms and feature extraction techniques. Such use cases include load forecasting, load shape/profile clustering, and synthetic load data creation [90] and inference of buildings' characteristics [91]. ...
Article
Full-text available
The development of smart grids, traditional power grids, and the integration of internet of things devices have resulted in a wealth of data crucial to advancing energy management and efficiency. Nevertheless, public datasets remain limited due to grid operators' and companies' reluctance to disclose proprietary information. The authors present a comprehensive analysis of more than 50 publicly available datasets, organised into three main categories: micro‐ and macro‐consumption data, detailed in‐home consumption data (often referred to as non‐intrusive load monitoring datasets or building data) and grid data. Furthermore, the study underscores future research priorities, such as advancing synthetic data generation, improving data quality and standardisation, and enhancing big data management in smart grids. The aim of the authors is to enable researchers in the smart and power grid a comprehensive reference point to pick suitable and relevant public datasets to evaluate their proposed methods. The provided analysis highlights the importance of following a systematic and standardised approach in evaluating future methods and directs readers to future potential venues of research in the area of smart grid analytics.
... In the absence of geolocation data, other authors have proposed deriving occupancy schedules from building meter data. For example, Miller and Meggers [57] presented a framework to infer information such as building use type, performance class and operational behavior of buildings based on electric meter data, and tested it on a dataset of 507 buildings with five different use types (office, university laboratory, university classroom, university dormitory, and school). In the UBEM field, Bianchi et al. [58] developed parametric schedules for occupancy modeling in large and diverse building stocks based on electric meter data collected for 24 982 buildings in Los Angeles comprising 22 different building use types. ...
Article
Energy demand from the built environment is among the most important contributors to greenhouse gas emissions. One promising way to curtail these emissions is through innovative energy management systems (EMS’s). These systems often rely on access to real-world demand data, which remains elusive in practice. Even when available, energy demand data typically suffers from missing data as well as many irregularities and anomalies. This precludes the application of many off-the-shelf machine learning algorithms for time series analysis and modelling, necessary for downstream energy management. Transforming energy demand time series to low dimensional feature matrices has been shown to work well in determining similar buildings and predicting meta-data, both of which can be used to create better forecast algorithms used as input in EMS’s. These studies are, however, often marred by the limited size of datasets, as well as the non-interpretable nature of extracted features. This paper addresses these concerns and makes several important contributions: (1) it collates several open-source datasets to create a large meta-analysis dataset containing energy demand data for over 13,000 buildings; (2) it investigates the use of different interpretable feature extraction methods on this collated dataset; and (3) it shows that this feature matrix can be used more generally to determine similar buildings and predict building properties such as missing meta-data. The large feature matrix resulting from the work is open-sourced as part of a web-based dashboard to enable the community to reproduce and further develop our results.
Article
Full-text available
With the proliferation of Internet of Things (IoT) sensors and metering infrastructures in buildings, external energy benchmarking, driven by time series analytics, has assumed a pivotal role in supporting different stakeholders (e.g., policymakers, grid operators, and energy managers) who seek rapid and automated insights into building energy performance over time. This study presents a holistic and generalizable methodology to conduct external benchmarking analysis on electrical energy consumption time series of public and commercial buildings. Differently from conventional approaches that merely identify peer buildings based on their Primary Space Usage (PSU) category, this methodology takes into account distinctive features of building electrical energy consumption time series including thermal sensitivity, shape, magnitude, and introduces KPIs encompassing aspects related to the electrical load volatility, the rate of anomalous patterns, and the building operational schedule. Each KPI value is then associated with a performance score to rank the energy performance of a building according to its peers. The proposed methodology is tested using the open dataset Building Data Genome Project 2 (BDGP2) and in particular 622 buildings belonging to Office and Education category. The results highlight that, considering the performance scores built upon the set of proposed KPIs, this innovative approach significantly enhances the accuracy of the benchmarking process when it is compared with a conventional approach only based on the comparison with the buildings belonging to the same PSU. As a matter of fact, an average variation of about 14% for the calculated performance scores is observed for a testing set of buildings.
Conference Paper
This review article investigates the methods proposed for disaggregating the space heating units’ load from the aggregate electricity load of commercial and residential buildings. It explores conventional approaches together with those that employ traditional machine learning, deep supervised learning and reinforcement learning. The review also outlines corresponding data requirements and examines the suitability of a commonly utilised toolkit for disaggregating heating loads from low-frequency aggregate power measurements. It is shown that most of the proposed approaches have been applied to high-resolution measurements and that few studies have been dedicated to low-resolution aggregate loads (e.g. provided by smart meters). Furthermore, only a few methods have taken account of special considerations for heating technologies, given the corresponding governing physical phenomena. Accordingly, the recommendations for future works include adding a rigorous pre-processing step, in which features inspired by the building physics (e.g. lagged values for the ambient conditions and values that represent the correlation between heating consumption and outdoor temperature) are added to the available input feature pool. Such a pipeline may benefit from deep supervised learning or reinforcement learning methods, as these methods are shown to offer higher performance compared to traditional machine learning algorithms for load disaggregation.
Article
Full-text available
As of 2015, there are over 60 million smart meters installed in the United States; these meters are at the forefront of big data ana-lytics in the building industry. However, only a few public data sources of hourly non-residential meter data exist for the purpose of testing algorithms. This paper describes the collection, cleaning, and compilation of several such data sets found publicly on-line, in addition to several collected by the authors. There are 507 whole building electrical meters in this collection, and a majority are from buildings on university campuses. This group serves as a primary repository of open, non-residential data sources that can be built upon by other researchers. An overview of the data sources, subset selection criteria, and details of access to the repository are included. Future uses include the application of new, proposed prediction and classification models to compare performance to previously generated techniques.
Article
Full-text available
Measured and simulated data sources from the built environment are increasing rapidly. It is becoming normal to analyze data from hundreds, or even thousands of buildings at once. Mechanistic, manual analysis of such data sets is time-consuming and not realistic using conventional techniques. Thus, a significant body of literature has been generated using unsupervised statistical learning techniques designed to uncover structure and information quickly with fewer input parameters or metadata about the buildings collected. Further, visual analytics techniques are developed as aids in this process for a human analyst to utilize and interpret the results. This paper reviews publications that include the use of unsupervised machine learning techniques as applied to non-residential building performance control and analysis. The categories of techniques covered include clustering, novelty detection, motif and discord detection, rule extraction, and visual analytics. The publications apply these technologies in the domains of smart meters, portfolio analysis, operations and controls optimization, and anomaly detection. A discussion is included of key challenges resulting from this review, such as the need for better collaboration between several, disparate research communities and the lack of open, benchmarking data sets. Opportunities for improvement are presented including methods of reproducible research and suggestions for cross-disciplinary cooperation.
Conference Paper
Full-text available
This paper presents a case study using CityBES to analyze potential retrofit savings of different energy conservation measures for city-scale building stocks. CityBES is a web-based tool designed to support city-scale building energy efficiency, including currently implemented features of energy retrofit analysis, and visualization of city building energy disclosure datasets. This case study uses CityBES to evaluate five common retrofit measures for 540 small and medium-sized office and retail buildings in downtown San Francisco. The results show: (1) all five measures together can save 22-48% of site energy per building, (2) replacing lighting with LED lights and adding air economizers to existing HVAC systems are most cost-effective, (3) the payback is long for upgrading HVAC systems due to the mild climate of San Francisco that does not have much cooling or heating loads.
Thesis
Full-text available
This study focuses on the screening of characteristic data from the ever-expanding sources of raw, temporal sensor data from commercial buildings. A two-step framework is presented that extracts statistical, model-based, and pattern-based behavior from two real-world data sets. The first collection is from 507 commercial buildings extracted from various case studies and online data sources from around the world. The second collection is advanced metering infrastructure (AMI) data from 1,600 buildings. The goal of the framework is to reduce the expert intervention needed to utilize measured raw data in order to extract information such as building use type, performance class, and operational behavior. The first step is feature extraction and it utilizes a library of temporal data mining techniques to filter various phenomenon from the raw data. This step transforms quantitative raw data into qualitative categories that are presented in heat map visual-izations for interpretation. In the second step, or the investigation, a supervised learning technique is tested in the ability to assign impact scores to the most important features from the first step. The efficacy of estimating variable causality of the characterized performance is tested to determine scalability amongst a heterogeneous sample of buildings. In the first set of case studies, characterization as compared to a baseline was three times more accurate in characterizing primary buildng use type, almost twice for performance class, and over four times for building operations type. For the AMI data, characterizing the standard industry class was improved by 27% and predicting the success of energy savings measures was improved by 18%. Qualitative insight from several campus case study interviews are discussed as well. The usefulness of the approaches was discussed in the context of campus building operations. ii Kurzfassung Diese Studie behandelt das Sichten, Sortieren und Bearbeiten charakteristischer Zeitrei-hen aus stark wachsenden Quellen fu¨rfu¨r rohe Sensordaten in kommerziellen Gebäuden. Ein zweistufiges Vorgehen wird präsentiert, das statistische, modellbasierte und Muster-basierte Verhaltensweisen von zwei Datensätzen extrahiert. Der erste Datensatz beinhal-tet Daten von 507 kommerziellen Gebäuden, zusammengetragen aus verschiedenen Fall-beispielen und online Datenquellen aus der ganzen Welt. Der zweite Datensatz beinhaltet Daten von Advanced Metering Infrastructure (AMI) von 1,600 Gebäuden. Das Ziel der vorgestellten Methode ist das Reduzieren benötigter Experteneingriffe, um gemessene Ro-hdaten benutzen zu können zum Erhalten von Information wie Gebäudenutzungstyp, Performance Klasse und Betriebsverhalten. Im ersten Schritt, dem Extrahieren von Charak-teristiken, werden durch das Benutzen einer Bibliothek von Data Mining Techniken ver-schiedene Phänomene aus den Rohdaten herausgefiltert. Dieser Schritt transformiert quantitative Rohdaten zu qualitativen Kategorien, die durch Heat Map Visualisierungen präsentiert und interpretiert werden. Im zweiten Schritt, der Datenuntersuchung, wird eine Supervised Learning Technique auf die Möglichkeit hin getestet, den wichtigsten Charak-teristiken aus dem ersten Schritt eine Auswertung der Auswirkungen zuzuordnen. Um das Hochskalieren für heterogene Gebäudeparks zu untersuchen wird die Wirksamkeit getestet, variable Kausalzusammenhänge der charakterisierten Performance zu schätzen. In den Fallstudien im ersten Datensatz war die Bestimmung des primären Gebäudenutzungstyps dreimal treffender, die Bestimmung der Performance Klasse fast zweimal treffender und die Bestimmung des Betriebsverhaltenstyps über viermal treffender als für ein Basisvorgehen. Für die AMI Daten wurde die Charakterisierung der Standard Industrie Klasse um 27% verbessert, die Prognose der Erfolgsrate von Energiesparmassnahmen um 18% verbessert. Interviews mit Akteuren von mehreren Schulanlagen werden diskutiert bezüglich ihrer qualitativen Einblicke und bezüglich der Nützlichkeit der vorgestellten Ansätze im Kon-text des Betriebs von Schulanlagen. iii
Article
Full-text available
This paper proposes a predictive segmentation technique for identifying sub-groups in a large population that are both homogeneous with respect to certain patterns in customer attributes, and predictive with respect to a desired outcome. Our motivation is creating a highly-interpretable and intuitive segmentation and targeting process for customers of energy utility companies that is also optimal in some sense. In this setting, the energy utility wants to design a small number of message types to be sent to appropriately-chosen customers who are most likely to respond to different types of communications. The proposed method uses consumption, demographics, and program enrollment data to extract basic predictive patterns using standard machine learning techniques. We next define a feasible potential assignment of patterns to a small number of segments described by expert guidelines and hypotheses about consumer characteristics, which are available from prior behavioral research. The algorithm then identifies an optimal allocation of patterns to segments that is feasible and maximizes predictive power. This is formulated as maximizing the minimum enrollment rate from across the segments, which is then expressed as solving a mixed-integer linear-fractional program. We propose a bisection-based method to quickly solve this program by means of identifying feasible sets. We exemplify the methodology on a large-scale dataset from a leading U.S. energy utility, and obtain segments of customers whose likelihood of enrollment is more than twice larger than that of the average population, and that are described by a small number of simple, intuitive rules. The segments designed this way achieve a 2–3× improvement in the probability of enrollment over the overall population.
Book
During the past decade there has been an explosion in computation and information technology. With it have come vast amounts of data in a variety of fields such as medicine, biology, finance, and marketing. The challenge of understanding these data has led to the development of new tools in the field of statistics, and spawned new areas such as data mining, machine learning, and bioinformatics. Many of these tools have common underpinnings but are often expressed with different terminology. This book describes the important ideas in these areas in a common conceptual framework. While the approach is statistical, the emphasis is on concepts rather than mathematics. Many examples are given, with a liberal use of color graphics. It should be a valuable resource for statisticians and anyone interested in data mining in science or industry. The book's coverage is broad, from supervised learning (prediction) to unsupervised learning. The many topics include neural networks, support vector machines, classification trees and boosting---the first comprehensive treatment of this topic in any book. This major new edition features many topics not covered in the original, including graphical models, random forests, ensemble methods, least angle regression & path algorithms for the lasso, non-negative matrix factorization, and spectral clustering. There is also a chapter on methods for "wide" data (p bigger than n), including multiple testing and false discovery rates. Trevor Hastie, Robert Tibshirani, and Jerome Friedman are professors of statistics at Stanford University. They are prominent researchers in this area: Hastie and Tibshirani developed generalized additive models and wrote a popular book of that title. Hastie co-developed much of the statistical modeling software and environment in R/S-PLUS and invented principal curves and surfaces. Tibshirani proposed the lasso and is co-author of the very successful An Introduction to the Bootstrap. Friedman is the co-inventor of many data-mining tools including CART, MARS, projection pursuit and gradient boosting.
Article
How to design water-in-glass evacuated tube solar water heater (WGET-SWH) with high heat collection rates has long been a question. Here, we propose a high-throughput screening (HTS) method based on machine learning to design and screen 3.538125 × 10⁸ possible combinations of extrinsic properties of WGET-SWH, to discover promising WGET-SWHs by comparing their predicted heat collection rates. Two new-designed WGET-SWHs were installed experimentally and showed higher heat collection rates (11.32 and 11.44 MJ/m², respectively) than all the 915 measured samples in our previous database. This study shows that we can use the HTS method to modify the design of WGET-SWH with just few knowledge about the highly complicated correlations between the extrinsic properties and heat collection rates of solar water heaters.