ArticlePDF Available

Mining electrical meter data to predict principal building use, performance class, and operations strategy for hundreds of non-residential buildings

December 2017
Energy and Buildings 156

December 2017
156

DOI:10.1016/j.enbuild.2017.09.056

Authors:

Clayton Miller

National University of Singapore

Forrest Michael Meggers

Princeton University

This study focuses on the inference of characteristic data from a data set of 507 non-residential buildings. A two-step framework is presented that extracts statistical, model-based, and pattern-based behavior. The goal of the framework is to reduce the expert intervention needed to utilize measured raw data in order to infer information such as building use type, performance class, and operational behavior. The first step is temporal feature extraction, which utilizes a library of data mining techniques to filter various phenomenon from the raw data. This step transforms quantitative raw data into qualitative categories that are presented in heat map visualizations for interpretation. In the second step, a random forest classification model is tested for accuracy in predicting primary space use, magnitude of energy consumption, and type of operational strategy using the generated features. The results show that predictions with these methods are 45.6% more accurate for primary building use type, 24.3% more accurate for performance class, and 63.6% more accurate for building operations type as compared to baselines.

Overview of data analytics framework

…

Temporal features extracted solely from raw sensor data

…

Characterization process to investigate the ability for various features to describe the classification objectives

…

Single building example of the spearman rank order correlation coefficent with weather

…

Heat map of a selection of statistics-based temporal features: Area-normalized consumption (left), ratio-based daily load max vs. min (center), and monthly spearman rank order correlation coefficient (right)

…

Figures - uploaded by Clayton Miller

Content may be subject to copyright.

Content uploaded by Clayton Miller

Content may be subject to copyright.

Mining electrical meter data to predict principal building use,

performance class, and operations strategy for hundreds of

non-residential buildings

Clayton Millera,b,∗

, Forrest Meggersc

aBuilding and Urban Data Science (BUDS) Group, Department of Building, National University of Singapore, 117566

Singapore

bInstitute of Technology in Architecture (ITA), Architecture and Building Systems (A/S), ETH Z¨urich, 8093 Z¨urich,

Switzerland

cCooling and Heating for Architecturally Optimized Systems (CHAOS) Lab, Andlinger Center for Energy and

Environment, Dept. of Architecture, Princeton University, Princeton, NJ, 08544, USA

Abstract

This study focuses on the inference of characteristic data from a data set of 507 non-residential

buildings. A two-step framework is presented that extracts statistical, model-based, and pattern-

based behavior. The goal of the framework is to reduce the expert intervention needed to utilize

measured raw data in order to infer information such as building use type, performance class,

and operational behavior. The ﬁrst step is temporal feature extraction, which utilizes a library

of data mining techniques to ﬁlter various phenomenon from the raw data. This step transforms

quantitative raw data into qualitative categories that are presented in heat map visualizations for

interpretation. In the second step, a random forest classiﬁcation model is tested for accuracy in

predicting primary space use, magnitude of energy consumption, and type of operational strategy

using the generated features. The results show that predictions with these methods are 45.6%

more accurate for primary building use type, 24.3% more accurate for performance class, and

63.6% more accurate for building operations type as compared to baselines.

Keywords: Data mining, Building performance, Performance classiﬁcation, Energy eﬃciency,

Smart meters

1. Introduction

The built and urban environments have a signiﬁcant impact on resource consumption and

greenhouse gas emissions in the world. The United States is the world’s second-largest en-

ergy consumer, and buildings there account for 41% of energy consumed1. The most extensive

meta-analysis thus far of non-residential existing buildings showed a median opportunity of 16%

energy savings potential by using cost-eﬀective measures to remedy performance deﬁciencies

[1]. Simply stated, roughly 6% of the energy consumed in the U.S. could be easily mitigated - a

∗Corresponding author: Phone: +65 81602452

Email address: clayton@nus.edu.sg (Clayton Miller)

1As of 2014, according to http://www.eia.gov/

Preprint submitted to Energy and Buildings October 6, 2017

ﬁgure that would eventually grow to an annual energy savings potential of $30 billion and 340

megatons of CO2by the year 2030. Beyond saving energy, money and mitigating carbon, the

impact of building performance improvement also extends to the health, comfort and satisfaction

of the people who use buildings.

It is mysterious that these performance improvements are not rapidly being identiﬁed and im-

plemented on a massive scale across the worlds building stock given the incentives and amount

of research focused on building optimization in the ﬁelds of Architecture, Engineering and Com-

puter Science. A comprehensive study of building performance analysis was completed by the

California Commissioning Collaborative (CACx) to characterize the technology, market, and

research landscape in the United States. Three of the key tasks in this project focused on estab-

lishing the state of the art [2], characterizing available tools and the barriers to adoption [3], and

developing standard performance metrics [4]. These reports were accomplished through investi-

gation of the available tools and technologies on the market as well as discussions and surveys

with building operators and engineers. The common theme amongst the interviews and case

studies was the lack of time and expertise on the part of the dedicated operations professionals.

The ﬁndings showed that installation time and cost was driven by the need for an engineer to de-

velop a full understanding of the building and systems. These barriers reduce the implementation

of performance improvements.

From these studies, it becomes apparent that the biggest barrier to achieving performance

improvement in buildings is scalability. Architecture is a discipline founded with aesthetic cre-

ativity as a core tenet. Frank Lloyd Wright once stated, “The mother art is architecture. Without

an architecture of our own, we have no soul of our civilization.” Designers rightfully strive for

artistic and meaningful creations; this phenomenon results in buildings with not only distinctive

aesthetics but also unique energy systems design, installation practices and diﬀerent levels of or-

ganization within the data-creating components. This paper shows that an emerging mass of data

from the built environment can facilitate better characterization of buildings by through automa-

tion of meta-data extraction. These data are temporal sensor measurements from performance

measurement systems.

1.1. Growth of Raw Temporal Data Sources in the Built Environment

As entities of analysis, buildings are less on the level of a typical mass-produced manufactured

device in which each unit is the same in its components and functionality; and more on the level

of customers of business, entities that are similar and yet have many nuances. Conventional

mechanistic or model-based approaches, typically borrowed from manufacturing, have been the

status quo in building performance research. As previously discussed, scalability among the

heterogeneous building stock is a signiﬁcant barrier to these approaches. More appropriate means

of analysis lies in statistical learning techniques more often found in the medical, pharmaceutical

and customer acquisition domains. These methods rely on extracting information and correlating

patterns from large empirical data sets. The strength of these techniques is in their robustness and

automation of implementation - concepts explicitly necessary to meet the challenges outlined.

This type of research on buildings would have been diﬃcult even a few years ago. The creation

and consolidation of measured sensor sources from the built environment and its occupants is oc-

curring on an unprecedented scale. The Green Button Ecosystem now enables the easy extraction

of performance data from over 60 million buildings2. Advanced metering infrastructure (AMI),

2According to http://www.greenbuttondata.org/

or smart meters, have been installed on over 58.5 million buildings in the US alone3. A recent

press release from the White House summarizes the impact of utilities and cities in unlocking

these data [5]. It announces that 18 power utilities, serving more than 2.6 million customers, will

provide detailed energy data by 2017. This study also suggests that such accessibility will enable

improvement of energy performance in buildings by 20% by 2020. A vast majority of these raw

data being generated are sub-hourly temporal data from meters and sensors.

1.2. Previous work

A signiﬁcant amount of work has been undertaken in the ﬁeld of building characterization

using measured meter data. A comprehensive review of unsupervised learning techniques for

various portfolio analysis and smart meter data was recently completed that includes much of

the previous work in this area [6]. The key studies in the ﬁeld of building characterization often

deal with segmentation of large numbers of buildings, usually within the realm of smart meter

analytics. Customer segmentation has been studied using various extracted temporal features

from smart meter data for targeting programs [7, 8, 9, 10]. Feature-based clustering of time-

series performance data from building is another key ﬁeld that precedes the current work. This

ﬁeld seeks to group various types of buildings or meters into similar clusters for analysis [11,

12, 13, 14, 15, 16, 17, 18]. Various studies have looked at classiﬁcation of building with various

objectives using temporal meter data as a source of features [19, 20, 21, 16, 22]. Several other

studies have extracted temporal features that enhance the ability to forecast consumption [23, 24,

25]. Several studies have analyzed larger than usual datasets from devices such as water heaters

[26] and retroﬁt analysis at the city scale [27].

1.3. A Framework for Automated Characterization of Large Numbers of Non-Residential Build-

ings

This paper discusses a framework to investigate which characteristics of whole building elec-

trical meter data are most indicative of various meta-data about buildings among large collections

of commercial buildings. This structure is designed to screen electrical meter data for insight on

the path towards deeper data analysis. The screening nature of the process is motivated by the

scalability challenges previously outlined. An initial component of the methodology was a se-

ries of case study interviews and data collection processes to survey ﬁeld data from numerous

buildings around the world. A signiﬁcant portion of this work was completed as part of a Ph.D.

dissertation entitled ”Screening Meter Data: Characterization of Temporal Energy Data from

Large Groups of Non-Residential Buildings” [28].

The contributions of this study are related to its development and testing of a library of tem-

poral machine learning features within the domain of non-residential buildings. To the author’s

best knowledge, no previous study has taken such a large number of buildings (507) and applied

temporal feature engineering approaches from such a wide range of sources. Temporal features

are extracted using techniques such as Seasonal Decomposition of Time Series by Loess (STL)

and Symbolic Aggregate approXimation (SAX) using Vector Space Models (VSM) that have

never been applied to electrical meter data from buildings. This study is also unique in that the

objective is prediction of meta-data about buildings. This target is related to the contemporary

challenge of large, raw temporal datasets from thousands of buildings with a signiﬁcant amount

of missing information; such is the case with large campuses, portfolios and utility-scale smart

meter implementations.

3As of 2014, according to http://www.eia.gov/tools/faqs/faq.cfm?id=108&t=3

2. Methodology

A two-step process is presented as a means of extracting knowledge from whole building

electrical meters. Figure 1 illustrates the intermediate steps in each of the phases. The ﬁrst

step is to create temporal features that produce quantitative data to describe various phenomenon

occurring in the raw temporal data. This action is intended to transform the data into a more

human-interpretable format and visualize the general patterns in the data. In this step, the data

are extracted, cleaned, and processed with a library of temporal feature extraction techniques

to diﬀerentiate various types of behavior. These features are visualized using an aggregate heat

map format that can be evaluated according to expert intuition, comparison with design intent

metrics, or with outliers detection.

The second step is focused on the characterization of buildings using the temporal features

according to several objectives. This process allows an analyst to understand the impact each

feature has upon the discrimination of each objective. Five test objectives are implemented in

this study: principal building use, performance class, and operations strategy. One of the key

outputs of this supervised learning process is the detection and discussion of what input features

are most important in predicting the various classes. This approach gives exploratory insight

into what features are important in determining various characteristics of a particular building

amongst a large set of its peers. These metadata are building blocks for many other techniques

such as benchmarking, diagnostics and targeting. The motivation for choosing these particular

objectives centers around the consistently available meta-data from the collected case study data

and their relation to various other techniques in the building performance analysis domain.

Figure 1: Overview of data analytics framework

2.1. Case study buildings and collected data

An open data set of 507 whole building electrical meters were utilized in this study for imple-

mentation of the two-step process. These buildings are from university campuses from around

the world. The origin and development of these data are found in Miller and Meggers [29].

A broad range of descriptive statistics and meta-data explanation are available in the previous

literature.

2.2. Temporal feature extraction

Feature extraction is an essential process of machine learning and is the means by which

objects are described quantitatively in a way that algorithms can diﬀerentiate between diﬀerent

types or classes. Much of these data are needed when creating an energy simulation model, when

setting thresholds for automated fault detection and diagnostics, or benchmarking a building.

When performing analysis on a single building, this meta-data might be easy to accumulate.

However, when such a process is scaled across hundreds or potentially thousands of buildings, a

collection of these data is not a trivial procedure.

The goal of temporal feature extraction and analysis is to use various techniques to convert

all these qualitative terms into a quantitative domain. For example, the descriptor weather-

dependency can be quantiﬁed through the utilization of the Spearman rank order correlation

coeﬃcient with outdoor air temperature. Consistency or volatility of daily, weekly, or annual

behavior can be quantiﬁed using various pattern recognition techniques. The primary focus of

this study is to create and apply some temporal feature extraction techniques on commercial

buildings for characterization. Figure 2 illustrates a hierarchy of the conventional categories of

temporal features and the new category of temporal features that include a few examples that are

outlined in this study.

Figure 2: Temporal features extracted solely from raw sensor data

Temporal features are aggregations of the behavior exhibited in time-series data. They are

characteristics that summarize sensor information in a way to inform an analyst through visu-

alization or to use as training data in a predictive classiﬁcation or regression model. Feature

extraction is a step in the process of machine learning and is a form of dimensionality reduction

of data. This process seeks to quantify various qualitative behaviors. This section provides an

overview of the categories of temporal features extracted from the case study building data, the

methods used to implement them, and visualized examples of a selected subset of features man-

ifest themselves over a time range. Table 1 gives an overview the temporal features outlined in

this section.

Feature Category General Description

Statistics-based Aggregations of time series data using

mean, median, max, min, standard devia-

tion

Regression model-based Development of a predictive model using

training data and using model parameters

and outputs to describe the data

Pattern-based Extraction of frequent and useful daily,

weekly, monthly, or long-term patterns

Table 1: Overview of feature categories

2.3. Characterization and prediction of meta-data

The primary goal of this research is to get a better sense of what behavior in time-series sensor

data is most characteristic of various types of buildings. As mentioned in the introduction, if

this meta-data can be discriminated, the process of characterizing a building can be automated.

In this section, the operation of using random forest classiﬁcation models and the input variable

importance feature. An overview of this process is found in Figure 3.

For each objective, two steps are taken to predict each objective and then to investigate the

inﬂuence of the input features on class diﬀerentiation. In the ﬁrst step, a random forest classiﬁ-

cation model is built using subsets of the generated features to predict the objectives class. In the

second phase, the classiﬁcation model indicates the ability of the temporal features in describing

the class based on its accuracy.

Figure 3: Characterization process to investigate the ability for various features to describe the classiﬁcation objectives

Random forest classiﬁcation models were chosen based on their capacity to model diverse

and large data sets in a robust way [30]. These models use an ensemble of decision trees to

predict various characteristic labels about each building based on its features. The literature

describes decision trees as the ”closest to meeting the requirements for serving as an oﬀ-the-shelf

procedure for data mining” [31]. Decision trees often over-ﬁt data due to high variance. Random

forest models work by creating a set of decision trees and averaging all of their predictions to

overcome this variance.

Random forests use a form of cross-validation by training and testing each tree using a diﬀer-

ent bootstrapped sample from the data. This process produces an out-of-bag error (OOB) that

acts as a generalized error for understanding how well each class can be predicted. This accuracy

is used to determine how well the generated temporal features can delineate the class objectives.

Random forests can also calculate the importance of the input features and how well they lend

themselves to predicting the targets. This attribute is useful in that it allows us to understand ex-

actly which temporal features are most characteristic of various objectives. Variable importance

is calculated using Equation 1. The importance of input feature Xmfor predicting Yby adding

up the weighted impurity decreases p(t)∆i(st,t) for all nodes twhere Xmis used, averaged over

all NTtrees in the forest [32].

Im p(Xm)=1

NTX

t∈T:v(st)=Xm

p(t)∆i(st,t) (1)

3. Statistics-based features

Statistics-based temporal features are the ﬁrst and most simpliﬁed category of temporal fea-

tures developed. The main classes of features are essential temporal statistics, ratio-based, and

the Spearman rank order correlation coeﬃcient.

3.1. Basic statistics

The ﬁrst set of temporal features to be extracted are basic statistics-based metrics that utilize

the time-series data vector for various time ranges to obtain information using mean, median,

maximum, minimum, range, variance, and standard deviation. Many of these features are devel-

oped through the implementation of the VISDOM package in the R programming language [33].

As a simple example, if a time-series vector is described as X, with Nvalues of X=x1,x2, ..., xn,

the most common statistical metric, mean (or µ), can be calculated using Equation 2 [34].

µ=

i=1

N(2)

The mean is taken not just for the entire time series, but also from the summer and winter

seasons. The variance of the values is taken for the whole year, the summer and winter seasons as

well. The variance of daily mean, minimum, and maximum values are determined to understand

the breadth of values across the time range. Variance is calculated according to Equation 3.

σ2=

i=1

(xi−µ)2

n(3)

The maximum and minimum electrical demand are calculated. Additionally, the hour and date

at which the maximum demand occurs are determined to understand when peak consumption

occurs. The 97th and 3rd percentiles are calculated to exclude any extreme outliers, a value

that’s often more useful than the maximum and minimum.

A series of hour-of-day (HOD) metrics are calculated that relate to aggregating the behavior

occurring at each of hour of the day. The ﬁrst of these calculate the most current hour of the

top demand of the 10% hottest days and the most common hour of the top 10% temperatures to

inform roughly about cooling energy consumption. These metrics are repeated from the bottom

10% coldest days and temperatures. Another set of twenty-four parameters is calculated to

account directly for the mean demand of each hour of the day.

3.2. Ratio-based statistical features

The second major category of statistical features is ratio-based features. Simply, these are

metrics in which two or more of the previously calculated statistical parameters are combined

as a ratio. These features often have a normalizing eﬀect in which buildings can be more ap-

propriately compared to each other. The ﬁrst extracted metric of this type is one of the most

commonly calculated for building performance analysis: the consumption magnitude of elec-

tricity normalized by the ﬂoor area of the building. This metric seeks to provide a basis for

comparison between buildings and is used as a key metric within numerous benchmarking and

performance analysis techniques.

3.3. Spearman rank order correlation coeﬃcient

Another useful metric to calculate is related to how much inﬂuence outside air temperature has

on the consumption of a building. Miller et al. describes a process of utilizing the Spearman Rank

Order Correlation (ROC) coeﬃcient to approximate the correlation between outside conditions

and the electrical consumption [35]. The ROC essentially ranks the items in two diﬀerent lists

the ratio quantiﬁes whether those lists are correlated positively or negatively. In this case, the

two variables are outside air temperature and electrical consumption. The coeﬃcient range is -1

(highly negatively correlated) and +1 (highly positively correlated). If the correlation is positive,

the ROC is positive and closer to +1and the electrical consumption is cooling sensitive as the

consumption goes up with higher temperature. If the correlation is negative, the ROC is negative

and closer to -1, and the time range is heating sensitive as the consumption goes up with lower

temperature.

The correlation coeﬃcient can be visualized for a single case as seen in Figure 4. The factor,

in this instance, is calculated individually for each month. This process results in twelve calcula-

tions of the metric using between 29-31 samples. In this case, consumption in January to May is

noticeably more heating sensitive, a fact that can be observed clearly from the line chart, as well

as the one dimension heat map. May to November is more cooling sensitive. It is interesting

that September appears to be the most cooling sensitive month, a fact perhaps related to using

schedules during that month. This coeﬃcient is not a perfect indicator of HVAC consumption;

it just detects a correlation. However, it is fast and easy to calculate and is the ﬁrst phase of

detecting weather dependency.

3.4. Implementation of stats-based features

Figure 5 illustrates the same normalized consumption metric as applied to all of the case study

buildings for three examples of the screening parameters: area-normalized consumption, ratio-

based daily load max vs. min, and monthly Spearman rank order correlation coeﬃcient. There

are ﬁve segments of buildings based on the primary use types within the set: oﬃces, university

laboratories, college classrooms, primary/secondary schools, and university dormitories. These

metrics are visualized in this way to understand the diﬀerence between each of these use types

for each of the presented metrics. Each row of the heatmap for each segment is the values of

Figure 4: Single building example of the spearman rank order correlation coeﬃcent with weather

the feature for a single building, while the x-axis is the time range for all buildings. Not all of

the case study buildings have a January to December time range. For these cases, the data was

rearranged so that a continuous set of January to December data is available to be visualized in

the heat map. The aggregation metrics themselves are not calculated with this rearranged vector;

it is only for visualization purposes.

4. Regression model-based features

Semi-physical behavior about a building can be extracted by using performance prediction

models and using output parameters and goodness-of-ﬁt metrics for characterization. This sec-

tion covers the use of several common electrical consumption prediction models to create sets of

temporal features useful for characterization of buildings.

Prediction of electrical loads based on their shape and trends over time is a mature ﬁeld de-

veloped to forecast consumption to detect anomalies and analyze the impact of demand response

and eﬃciency measures. The most common technique in this category is the use of heating

and cooling degree days to normalize monthly consumption [36]. Over the years, various other

methods have been developed using techniques such as neural networks, ARIMA models, and

more complex regression [37]. However, simpliﬁed methods have retained their usefulness over

time due to ease of implementation and accuracy. In the context of temporal feature creation,

a regression model provides various metrics that describe how well a meter conforms to con-

ventional assumptions. For example, if actual measurements and predicted consumption match

well, the underlying behavior of energy-consuming systems in the building has been captured

adequately. If not, there is the uncharacteristic phenomenon that will need to be obtained with a

diﬀerent type of model or feature.

Figure 5: Heat map of a selection of statistics-based temporal features: Area-normalized consumption (left), ratio-based

daily load max vs. min (center), and monthly spearman rank order correlation coeﬃcient (right)

4.1. Load shape regression-based features

A contemporary, simpliﬁed load prediction technique is selected to create temporal features

that capture whether the electrical measurement is simply a function of time-of-week scheduling.

This model was developed by Matthieu et al. and Price and implemented mostly in the context

of electrical demand response evaluation [38, 39]. The premise of the model is based on two

features: a time-of-week indicator and an outdoor air temperature dependence. This model is

also known as the Time-of-week and Temperature or (TOWT) model or LBNL regression model

and is implemented in the eetd-loadshape library developed by Lawrence Berkeley National

Laboratory4.

According to the literature, the model operates as follows [38]. The time of week indicator

is created by dividing each week into a set of intervals corresponding to each hour of the week.

For example, the ﬁrst interval is Sunday at 01:00, the second is Sunday at 02:00, and so on. The

last, or 168th, interval is Saturday at 23:00. A diﬀerent regression coeﬃcient, αi, is calculated

for each interval in addition to temperature dependence. The model uses outdoor air temperature

dependence to divide the intervals into two categories: one for occupied hours and one for un-

occupied. These modes are not necessarily indicators of exactly when people are inhabiting the

building, but merely an empirical indication of when occupancy-related systems are detected to

be operating. Separate piecewise-continuous temperature dependencies are then calculated for

4https://bitbucket.org/berkeleylab/eetd-loadshape

each type of mode. The outdoor air temperature is divided into six equally sized temperature

intervals. A temperature parameter, βj, with j=1...6, is assigned to each interval. Within the

model, the outdoor air temperature at time, t, occurring at time-of-week, i, (designated as T(ti))

is divided into six component temperatures, Tc,j(ti). Each of these temperatures is multiplied

by βjand then summed to determine the temperature-dependent load. For occupied periods the

building load, Lo, is calculated by Equation 4.

L0(ti,T(ti)=αi+

j=1

βjTc,j(ti) (4)

Prediction of unoccupied mode occurs using a single temperature parameter, βu. Unnoccupied

load, Lu, is calculated with Equation 5.

L0(ti,T(ti)=αi+βuTc,j(ti) (5)

The primary means of temporal feature creation from this process is through the analysis of

model ﬁt. The ﬁrst metric calculated is a normalized, hourly residual, R, that can be used to

visualize deviations from the model. It is calculated from the actual load, La, and the predicted

load, Lp. The residual at a particular hour, t, is calculated using Equation 6.

Rt=Lt,a−Lt,p

maxLa

(6)

An example of the TOWT model implemented on one of the case study buildings is seen in

Figure 6. Two primary characteristics are captured from a model residual analysis. The ﬁrst

is the building’s deviation from a set time-of-week schedule and behavior causing the model

to highly over-predict. These deviations are most often attributed to public holidays, breaks in

normal operation, or changes in normal operating modes. In the single building study, one of

the most visible daily deviations, Christmas Day, is observed. This day is signiﬁcantly over-

predicted due to the model not being informed of the Christmas Day holiday. The automated

capture of this phenomenon can report whether the building is of a certain use-type or in an

individual jurisdiction. The second characteristic obtained are periods of underprediction when

the building is consuming more electricity than expected. These data inform whether a building

is being consistently utilized, or whether there is volatility in its normal operating schedule from

week-to-week.

4.2. Change point model regression

Another means of performance modeling that considers weather characterization is the use of

linear change point models. The outputs of these models can be interpretable in approximating

the amount of energy being used for heating, ventilation, and air-conditioning (HVAC). This type

of model has its basis in the previously-mentioned PRISM method and has been continuously

utilized, recently by Kissock and Eger [40]. This multivariate, piece-wise regression model is

developed using daily consumption and outdoor air dry-bulb temperature information. A linear

regression model is ﬁtted to data detected to be correlated with outdoor dry-bulb air temperature,

either positively for cooling energy consumption or negatively for heating energy consumption.

For example, as the outdoor air temperature climbs above a certain point, the relationship be-

tween electricity consumption and every degree increase in temperature should be a straight line

with a certain slope if the building has an electrically-driven cooling system. The point at which

Figure 6: Single building example of TWOT model with hourly normalized residuals

this change occurs is considered the cooling balance point of the building, and the slope of the

line is the rate of cooling energy increase due to outdoor air conditions.

Equations 7 and 8 are used to predict energy consumption based on an outdoor air temperature,

T. This equation can also predict the heating (β2(T−β3)) or cooling (β2(β3−T)) components of

the electrical consumption to a certain level of accuracy.

Ec=β1+β2(T−β3) (7)

Eh=β1+β2(β3−T) (8)

4.3. Seasonality and trend decomposition

Temporal data from diﬀerent sources often exhibit similar types of behavior that are stud-

ied within the ﬁeld of forecasting and temporal data mining. Electrical building meter data ﬁts

into this category, and the same feature extraction techniques can be applied as what is com-

monly done for ﬁnancial or social science analysis. These techniques often seek to decompose

time-series data into several components that represent the underlying nature of the data [34].

For example, the electrical meter data collected from buildings is often cyclical in its weekly

schedule. People are utilizing buildings each day of the week in a relatively predictable pattern.

A prevalent example of this behavior is found in oﬃce buildings where occupants are typical

white-collar professionals who come into work on weekdays at a particular time and leave to go

home at a certain time. Weekends are unoccupied periods in which there is little to no activity.

This behavior is an example of what’s known as seasonality within time series analysis. Season-

ality is a ﬁxed and known period of consistent modulation and is a feature that is often extracted

before creating predictive models.

Trends are another feature commonly found in temporal data. A trend is a long-term increase

or decrease in the data that often doesn’t follow a particular pattern. Trends are commonly due

to factors that are less systematic than seasonality and are often due to external inﬂuences. For

building energy consumption, trends manifest themselves as gradual shifts in consumption over

the course of weeks or months. Often these variations are due to weather-related factors inﬂu-

encing the HVAC equipment. Other causes of trends are changes in occupancy of degradation of

system eﬃciency.

To capture these features to understand their impact on characterizing buildings, the seasonal-

trend decomposition procedure based on loess is used to extract each of these features from

the case study buildings [41]. This process is used to remove the weekly seasonal patterns

from each building, the long-term trend over time, and the residual remainders from the model

developed by those two components. The input data is aggregated to daily summations and

weather-normalized by subtracting the calculated heating and cooling elements from the change

point model described in Section 4.2. This step is done to reduce the inﬂuence weather plays in

the trend decomposition. The STL package in R is used for this process to extract the seasonal,

trend, and irregular components 5.

The details of the internal algorithms of the STL procedure are described by Cleveland et al.

[41]. The process uses an inner loop of algorithms to detrend and deseasonalize the data by

creating a trend component, Tv, and a seasonal component, Sv. The remainder component, Rv, is

a subtraction of the input values, Yvas seen in Equation 9.

Rv=Yv−Tv−Sv(9)

4.4. Implementation of model-based features

Figure 7 illustrates an overview of an implementation of three examples of model-based fea-

tures on all the buildings across the various building use types in the study. The heat map at the

far left illustrates normalized residuals from the load shape regression model. The diﬀerences

between each use type can be noticed from a high level due to the nature of residuals. The

darker areas of the visualization indicate when the model is highly over-predicting consumption

and lighter areas indicate when the model is under-predicting. Typical holiday periods such as

spring, summer and winter breaks and holidays such as the American Labor Day and Thanksgiv-

ing are seen as darker areas. Oﬃces, labs and classrooms seem to have similar residual patterns,

likely due to their scheduling being similar. Slight fundamental diﬀerences are seen such as

the fact that classrooms have more general areas of over-prediction, likely due to less consistent

occupancy. Primary/Secondary schools and dormitories are less predictable on an annual basis

due to their strong seasonal patterns of use; this fact is intuitive, and model residuals of this type

are accurate in automatically characterizing this behavior. The center ﬁgure illustrates heating

energy regression for all case study buildings. These ﬁgures have been normalized according to

ﬂoor area. Each building’s response to outdoor air temperature is indicative of the type of sys-

tems installed in addition to the eﬃciency of energy conversion of those systems. The far right

heat map illustrates the trend decomposition as applied to the entire case study set of buildings.

Oﬃces appear to have quite a bit of diversity over time, with a few observable systematic low

spots in the spring and autumn periods at the bottom of the heat map. Laboratories reﬂect that

behavior, while university visually has an opposite eﬀect with less than the average trend in the

5https://stat.ethz.ch/R-manual/R-devel/library/stats/html/stl.html

summer months. Primary/Secondary school classrooms have a very distinct delineation between

when school is in session and out of session during the summer and various breaks. As many of

these schools are in the UK, their out-of-session periods appear to line up naturally. University

dormitories also have clear delineations between occupied and unoccupied periods, and they also

seem to match up quite well, despite the diversity of data sources of these buildings.

Figure 7: Heat map of a selection of model-based temporal features: Daily normalized residuals from load shape regres-

sion models (left), heating energy prediction using change point model regression (center), and seasonal trends using the

STL package (right)

5. Pattern-based features

The third category of temporal features that is developed in this study is related to capturing the

typical and atypical patterns of use from building performance data. The goal of these features

is to quantify whether a building has consistency on a daily or weekly basis, whether certain

building types have certain types of patterns of use and to inform how they can be used to predict

various kinds of meta-data. In temporal feature mining, two concepts are relevant in this analysis:

motifs and discords. A motif is a typical pattern that occurs on a regular basis within a data set

[42]. A discord is an unusual pattern within a data set that identiﬁes infrequent behavior [43].

Several of the temporal features developed in this process is designed to leverage these concepts.

The pattern-based feature categories outlined in this section include diurnal pattern extraction,

pattern speciﬁcity, and long-term consistency.

5.1. Dirunal pattern extraction

The ﬁrst temporal feature outlined is based on the DayFilter process which extracts the motifs

and discords from raw meter data based on 24 hour periods [44]. This process heavily utilizes

the Symbolic Aggregate approXimation (SAX) representation of time-series data [45]. SAX is

a process of time-series data discretization that converts temporal data into the string data type.

This process empowers various text mining and visualization techniques. The primary feature

extracted from this process for this study is diurnal pattern frequency which quantiﬁes the number

and size of motifs found from a particular meter.

5.2. Pattern speciﬁcity

Another method to leverage SAX to characterize the case study data is to use it to extract

which patterns are most indicative of a particular building use type. This information is obtained

using the SAX-VSM process pioneered by Senin and Malinchik that uses SAX and Vector Space

Model technique from the text mining ﬁeld [46]. Conventionally this method is utilized as a clas-

siﬁcation model to predict which class a certain time-series belongs. A by-product of the process

is that the subsequences of each data stream are assigned a metric indicating their speciﬁcity. Pat-

tern speciﬁcity is a concept that quantiﬁes how well a meter ﬁts within its class. This technique

is used to determine whether a building is operating similar to other supposed peer buildings of

the same type.

5.3. Long-term pattern consistency

The concept of long-term consistency is related to how volatile a building’s electrical con-

sumption is over the course of a long-range period such as one year. A building that is consid-

ered more volatile will have signiﬁcant shifts in steady-state operation over the process of a year.

Often these changes are related to seasonality of scheduling that can be the case in buildings like

schools and universities. A less volatile building will be more consistent in overall magnitude

of consumption over the course of a year. This behavior is more often the case in oﬃces and

laboratories. In this analysis, a concept known as breakout detection is utilized to quantify the

diﬀerence between these behaviors. A metric is created to detect the number of shifts in relative

steady-state over the course of the time range. This metric was developed in a previous study

focused on data from a single campus [35]. An R programming package, BreakoutDetection, is

used to create this parameter. This package was developed by the social media company Twitter

to process their time-series data6. The details of the algorithms utilized in this package can be

found in a study by James et al. [47].

Figure 8 illustrates the breakout detection process from a single building data stream. This

dataset includes hourly data from an entire year from a university dormitory building. A min-

imum threshold of 30 days is chosen in this case, which explains the lack of threshold shift in

April, a break that may be attributed to spring break for this building. Seven total steady-state

variations were detected by the algorithm in this case, and many of them occur in the conven-

tional university scheduled summer, spring, fall and winter breaks.

6https://github.com/twitter/BreakoutDetection

Figure 8: Single building example of breakout detection to test for long-term volatility in an university dormitory build-

ing.

5.4. Implementation of pattern-based features

Figure 9 shows three pattern-based features as applied to all the case study buildings. The

far right heat map shows the pattern frequency metric as applied to all the case study buildings

extracted from the DayFilter process. One will notice that there is a range of pattern frequencies

occurring across each of the building use types. Oﬃces and Primary/Secondary Classrooms

seem to have larger regions of darker, more consistent behavior. Labs and Classrooms seem to

be more volatile across the time ranges. The center heat map illustrates breakout detection across

the building use types in this study. This implementation uses the same input parameter of a 30

day minimum between breakouts. One notices somewhat of consistency among oﬃces, labs,

and classrooms regarding the distribution of breakout numbers, while university dormitories and

primary/secondary classrooms have a noticeably higher number of breakouts across the range

of behavior. The far right heat map illustrates the daily speciﬁcity calculation process applied

to all 507 case studies as divided among the use types. Clear diﬀerences in patterns across the

time ranges are visible for each of the building use types. Oﬃces, university laboratories, and

university classrooms all seem to have similar phases of speciﬁcity at similar times of the year,

while their breaks often diﬀerentiate dorms and primary/secondary schools.

6. Prediction of building Use, performance class, and operational strategy

Visualization of temporal features on their own is a means of understanding the range of values

of the various phenomenon across a time range. This situation gives an analyst the basis to begin

understanding what discriminates a building based on diﬀerent objectives. The next step is to

utilize the features to predict whether a building falls into a particular category and test the

importance of various elements in making that prediction. Understanding which features are

most characteristic to a particular objective is the fundamental tenet of this study. In this section,

three classiﬁcation objectives are tested:

1. Principle Building Use - The primary use of the building is designated for the principal

activity conducted by percentage of space designated for that activity. It is rare for a building

Figure 9: Heat map of a selection of pattern-based temporal features: daily pattern frequency from the DayFilter process

(left), breakout detection for long term volatility (center), and in-class speciﬁcity using the SAX-VSM process (right)

to be devoted speciﬁcally to a single task, and mixed-use buildings pose a speciﬁc challenge

to prediction.

2. Performance Class - Each building is assigned to a particular performance class according

to whether its area-normalized consumption in the bottom, middle, or top 33% percentiles

within its principle building use-type class.

3. General Operation Strategy - Buildings that are controlled by the same entity, such as those

on a University campus, often have similar schedules, operating parameters, and use pat-

terns. This objective tests to understand how distinct these diﬀerences are between diﬀerent

campuses.

6.1. Principal building use

The ﬁrst scenario investigated is the characterization of primary building use type. The goal of

this eﬀort is to quantify what temporal behavior is most characteristic in a building being used

for a certain purpose. For example, what makes the electrical consumption patterns of an oﬃce

building unique as compared to other purposes such as a convenience store, airport, or laboratory.

This objective is necessary to understand who are the peers of a building. Whatever category a

building is assigned determines what benchmark is used to determine the performance level of

a building. The EnergyStar Portfolio Manager is the most common benchmarking platform in

the United States and the ﬁrst step in its evaluation is identifying the property type. There are

80 property types in portfolio manager and each one is devoted to a particular primary building

use type. Twenty-one of those property types are available for submission to achieve a 1-100

ENERGYSTAR score in the United States.

Allocation of the primary use type of a building is often considered a trivial activity when

analyzed from a smaller set of buildings. As the number of building being analyzed grows, so

does the complexity of space use evaluation. The use of buildings changes over time and these

changes are not always documented. In several of the case studies, this topic was discussed and

highlighted as an issue concerning benchmarking a building.

Discriminatory features have already been visualized extensively in previous sections and the

diﬀerences between the primary use types are apparent in the overview heat maps of each feature.

Figure 10 is the ﬁrst such example of the output results of the classiﬁcation model in predicting

the building’s primary use type using the temporal features created in this study. This visualiza-

tion is a kind of error matrix, or confusion matrix, that illustrates the performance of a supervised

classiﬁcation algorithm. The y-axis represents the correct label of each classiﬁcation input and

the x-axis is the predicted label. An accurate classiﬁcation would fall on the left-to-right diagonal

of the grid. This grid is normalized according to the percentage of buildings within each class.

The model was built using the scikit-learn Python library7with the number of estimators set to

100 and the minimum samples per leaf set to 2. The overall general accuracy of the model is

67.8% as compared to a baseline model of 22.2%. The baseline model using a stratiﬁed strategy

in which categories are chosen randomly based on the percentage of each class occurring in the

training set. Based on the analysis, university dormitories and primary/secondary classrooms

are the best-characterized use types overall with a precision of 92% and 96% respectively and

accuracies of 74% and 75%. The oﬃce category is easily confused with university classrooms

and laboratories. This situation is not surprising as many of these facilities are quite similar and

uses within these categories often overlap.

Previously, an example of how to characterize building use type was illustrated using a

random forest model and various feature importance techniques. In this subsection, a discussion

is presented of how this sort of characterization can be useful in a practical setting. In the case

study interviews, the topic of benchmarking of buildings was discussed. One of the issues

presented to the operations teams was the concept of not having a complete understanding of

the way the buildings on their campus were being used. For example, several of the campuses

have a spreadsheet outlines various metadata about the facilities on campus. This worksheet, in

many cases, includes the primary use type of the building. It was found that this primary use

type designation is often loosely based on information from when the building was constructed

or through informal site survey. In other situations, the building has an accurate breakdown

of all the sub-spaces in the building and approximately for what the spaces are being used.

In these discussions, the idea was presented that building use type characterization could be

used to determine automatically whether the labels within these spreadsheets aligned with the

patterns of use characterization using the temporal feature extraction. This proposal was met

some positive feedback, albeit there was a hesitation to conﬁrm fully that this process would be

entirely necessary if labor were directed to do the same task.

Many of the case study subjects then were shown a series of graphics designed to tell the

story of building use type characterization in an automated way. Figure 11 is the ﬁrst graphic

shown to the subjects. Each of the variables visualized in this ﬁgure has been scaled within

7http://scikit-learn.org/

Figure 10: Classiﬁcation error matrix for prediction of building use type using a random forest model

their ranges, which causes the most extreme values to occur at the minimum and maximum of

the y-axis. This ﬁgure illustrates several of the most easily understood temporal features and

how they break down across the various building use types. This graphic was created using

the data for a particular case study; therefore more separation between the classes exist than

in the prediction of classes found in the previous section. Discussions using this graphic ﬁrst

centered around the ﬁrst feature: Daily Magnitude per Area. It was intuitive to most participants

that a university laboratory has more and primary/secondary schools have less consumption per

area than the other use types. It is more surprising, however, that certain building use types are

characterized well by other features, such as a number of breakouts with primary/secondary

schools and daily and weekly speciﬁcity with university dormitories. Outlier buildings for each

of the primary use types can be found for all of the variables; this occurrence is natural in the

construction industry, and these have not been ﬁltered.

Figure 11: Simpliﬁed breakdowns of general features according to building use type that were presented to case study

subjects

6.2. Characterization of building performance class

The second objective targeted in this study is the ability for temporal features to characterize

whether a building performs well or not within it use-type class. Consumption is the metric being

measured; therefore it’s not the goal of this analysis to predict the performance of a building, its

to determine which temporal characteristics are correlated with good or poor performance. This

eﬀort is related to the process of benchmarking buildings. Using the insight gained through char-

acterization of building use type, it is possible to inform whether a building’s behavior matches

its peers. Once a building is part of a peer group, its necessary to understand how well that build-

ing performs within that group. In this section, the case study buildings are divided according to

which percentile each ﬁts within in its in-class performance. The buildings are divided accord-

ing to percentiles, with those in the lowest 33% are classiﬁed as ”Low,” the 33 to 66% percentile

are ”Intermediate,” and the top 33% are classiﬁed as ”High.” As in the previous section, these

classiﬁcations and a subset of temporal features are implemented into a random forest model to

understand how well the features are at characterizing the diﬀerent classes. Since this objec-

tive is related to consumption, all input features with known correlations to consumption were

removed from the training set. These include the prominent features of consumption per area,

but also include many of the statistical metrics such as maximum and minimum values. Most of

the daily ratio input features remain in the analysis as they are not directly correlated with total

consumption. Figure 12 illustrates the results of the model in an error matrix. It can be seen

that high and low consuming buildings are well characterized. The intermediate buildings have

higher error rates and are often misclassiﬁed with the other two classes. The overall accuracy of

the model for classiﬁcation is 62.3% as compared to a baseline of 38%.

In a situation similar to the discussion about building use type, participants in the case studies

were guided through the process of analysis using a subset of features from buildings on their

campus. Figure 13 illustrates a graphic that was shown to the groups. In this case, the buildings

Figure 12: Classiﬁcation error matrix for prediction of performance class using a random forest model

are divided into two classes: Good and Bad. These categories are based on whether the building

falls in the upper or lower 50% within its class. The ﬁrst observation by the case study partici-

pants is that the load diversity, or the daily maximum versus minimum, is a reliable indicator of

the performance class. This fact is not surprising as this metric indicates the magnitude of the

base load consumption as compared to the peak. Other relatively strong diﬀerentiators, in this

case, are cooling energy, seasonal changes, and weekly speciﬁcity. The discussions related to

this graphic centered around the potential for the temporal features to inform why a building is

performing well or not.

Figure 14 illustrates another graphic related to building consumption classes that were dis-

cussed with case study participants. This graphic is an overview of the distributions of the sim-

pliﬁed set of features for a certain campus as compared to the entire set of case study buildings.

This graphic shows where the buildings on this campus stand as compared to their peers. In this

case, the buildings are on the higher end of the normalized consumption, which could likely be

because they’re also almost all in the most top 20% of buildings for heating energy consumption.

Figure 13: Simpliﬁed breakdowns of general features according to performance level that were presented to case study

subjects

The buildings also have a relatively high load diversity, thus the base loads for this campus are

likely higher than average and interventions could be designed to reduce this unoccupied load.

Many of the case study participants saw this insight as useful as it supplements the information

from benchmarking.

Figure 14: Feature distributions of a single campus as compare to all other case study buildings

6.3. Characterization of operational strategies

The ﬁnal characterization objective for the case studies is the ability for the temporal features

to classify buildings from the same campus, and thus buildings that are being operated in similar

ways. This characterization takes into account the similarity in occupancy schedules, patterns of

use, and other factors related to how a building performs. Like the performance classes, this type

of classiﬁcation is more important in understanding the features that contribute to the diﬀerentia-

tion, rather than the classiﬁcation itself. Seven campuses were selected from the 507 buildings to

create seven groups of buildings to characterize the diﬀerence between their operating behavior.

Features were removed for this objective that are indicators of weather sensitivity as these would

be related to the location of the buildings, and thus, the campus that they’re located. Figure 15

illustrates the results from the random forest model trained on these data. The accuracy of this

model is 80.5% as compared to a baseline of 16.9%. The model is excellent at predicting some of

the groups, such as groups 1-4, which more deﬁcient in others, such as 5-7. The high accuracy of

this prediction is surprising and lends itself to the ability of the temporal features and the random

forest model to predict the operational normalities of these buildings.

Figure 15: Classiﬁcation error matrix for prediction of operations group type using a random forest model

7. Conclusion

This paper was undertaken with objectives related to the characterization of building behavior

using temporal feature extraction and variable importance screening. The primary goal of the

eﬀort is to automate the process of predicting various types of meta-data.

A framework of analysis was developed to address and test this eﬀort. This process was

implemented on two sets of case study buildings and the key quantitative conclusions include:

•The framework can characterize primary building use type with a general accuracy of 67.8%

as compared to a baseline model of 22.2% based on ﬁve use type classes. Temporal features

enable a three-fold increase in building use prediction. Pattern-based features are the most

common category in the top ten in the characterization of use-type, thus are important dif-

ferentiators as compared to more traditional features. Features from the stl decomposition

process were found to be important as well due to the ability to distinguish diﬀerences in

normalized weekly patterns.

•Building performance class overall accuracy of the model for classiﬁcation is 62.3% as

compared to a baseline of 38%. The top indicator of high versus low building in-class

performance was temporal features pattern speciﬁcity. Once again, pattern-based temporal

features were found to be signiﬁcant in distinguishing between diﬀerent types of behavior.

•For operations class, the accuracy of this model is 80.5% as compared to a baseline of

16.9%, a four-fold increase. Daily scheduling of buildings was captured using the DayFilter

features, accounting for half of the entire input features.

7.1. Open data and reproducible research

The source code and analytics workﬂow of this paper can be found in a series of Jupyter

notebooks found in a GitHub repository8. The data set that is utilized for the analysis is the

Building Data Genome Project9. These analysis ﬁles can be downloaded, and much of the work

replicated.

8. Acknowledgements

The authors would like to thank all of the building operations and maintentance professionals

from around the world that assisted in the gathering of the data utilized. This study was funded

by a Fellowship from the Institute of Technology in Architecture (ITA) at the ETH Z ¨

urich.

9. References

[1] E. Mills, Building commissioning: a golden opportunity for reducing energy costs and greenhouse gas emissions

in the United States, Energy Eﬃciency 4 (2011) 145–173.

[2] M. Eﬃnger, H. Friedman, D. Moser, Building Performance Tracking in Large Commercial Buildings: Tools and

Strategies - Subtask 4.2 Research Report: Investigate Energy Performance Tracking Strategies in the Market, Tech-

nical Report, 2010.

8https://github.com/buds-lab/temporal-features-for-nonres-buildings-library

9http://www.buildingdatagenome.org/

[3] J. Ulickey, T. Fackler, E. Koeppel, J. Soper, Building Performance Tracking in Large Commercial Buildings:

Tools and Strategies - Subtast 4.3 Characterization of Fault Detection and Diagnostic (FDD) and Advanced Energy

Information System (EIS) Tools, Technical Report, 2010.

[4] E. Greensfelder, H. Friedman, E. Crowe, Building Performance Tracking in Large Commercial Buildings: Tools

and Strategies - Subtask 4.4 Research Report: Characterization of Building Performance Metrics Tracking Method-

ologies, Technical Report, 2010.

[5] The White House, FACT SHEET: Cities, Utilities, and Businesses Commit to Unlocking Access to Energy Data

for Building Owners and Improving Energy Eﬃciency (2016).

[6] C. Miller, Z. Nagy, A. Schlueter, A review of unsupervised statistical learning and visual analytics techniques

applied to performance analysis of non-residential buildings, Renewable and Sustainable Energy Reviews (2017).

[7] A. Albert, M. Maasoumy, Predictive segmentation of energy consumers, Applied Energy 177 (2016) 435–448.

[8] A. Albert, R. Rajagopal, R. Sevlian, Segmenting Consumers Using Smart Meter Data, in: Proceedings of the

Third ACM Workshop on Embedded Sensing Systems for Energy-Eﬃciency in Buildings, BuildSys ’11, ACM,

New York, NY, USA, 2011, pp. 49–50.

[9] S. Borgeson, Targeted Eﬃciency: Using Customer Meter Data to Improve Eﬃciency Program Outcomes, PhD,

University of California, Berkeley, Berkeley, CA, USA, 2013.

[10] J. Kwac, R. Rajagopal, Demand response targeting using big data analytics, in: Big Data, 2013 IEEE International

Conference on, IEEE, 2013, pp. 683–690.

[11] T. Rasanen, M. Kolehmainen, Feature-based clustering for electricity use time series data, in: Adaptive and

Natural Computing Algorithms: 9th International Conference, ICANNGA 2009, Springer, Kuopio, Finland, 2009,

pp. 401–412.

[12] F. Iglesias, W. Kastner, Analysis of Similarity Measures in Times Series Clustering for the Discovery of Building

Energy Patterns, Energies 6 (2013) 579–597.

[13] S. Petcharat, S. Chungpaibulpatana, P. Rakkwamsuk, Assessment of potential energy saving using cluster analysis:

A case study of lighting systems in buildings, Energy and Buildings 52 (2012) 145–152.

[14] A. Lavin, D. Klabjan, Clustering time-series energy data from smart meters, Energy Eﬃciency 8 (2014) 681–689.

[15] G. Chicco, I. S. Ilie, Support vector clustering of electrical load pattern data, IEEE Transactions on Power Systems

24 (2009) 1619–1628.

[16] S. M. Bidoki, N. Mahmoudi-Kohan, M. H. Sadreddini, M. Zolghadri Jahromi, M. P. Moghaddam, Evaluating

diﬀerent clustering techniques for electricity customer classiﬁcation, in: Transmission and Distribution Conference

and Exposition, 2010 IEEE PES, IEEE, New Orleans, LA, USA, 2010, pp. 1–5.

[17] I. Panapakidis, M. Alexiadis, G. Papagiannis, Evaluation of the performance of clustering algorithms for a high

voltage industrial consumer, Engineering Applications of Artiﬁcial Intelligence 38 (2015) 1–13.

[18] S. P. Pieri, IoannisTzouvadakis, M. Santamouris, Identifying energy consumption patterns in the Attica hotel sector

using cluster analysis techniques with the aim of reducing hotels CO2 footprint, Energy and Buildings 94 (2015)

252–262.

[19] H. X. Zhao, F. Magoules, A review on the prediction of building energy consumption, Renewable and Sustainable

Energy Reviews 16 (2012) 3586–3592.

[20] S. V. Verdu, M. O. Garcia, C. Senabre, A. G. Marin, F. J. G. Franco, Classiﬁcation, Filtering, and Identiﬁcation

of Electrical Customer Load Patterns Through the Use of Self-Organizing Maps, IEEE Transactions on Power

Systems 21 (2006) 1672–1682.

[21] A. R. Florita, L. J. Brackney, T. P. Otanicar, J. Robertson, Classiﬁcation of Commerical Building Electrical Demand

Proﬁles for Energy Storage Applications, in: Proceedings of ASME 2012 6th International Conference on Energy

Sustainability & 10th Fuel Cell Science, Engineering and Technology Conference (ESFuelCell2012), San Diego,

CA, USA.

[22] T. G. Nikolaou, D. S. Kolokotsa, G. S. Stavrakakis, I. D. Skias, On the Application of Clustering Techniques

for Oﬃce Buildings’ Energy and Thermal Comfort Classiﬁcation, IEEE Transactions on Smart Grid 3 (2012)

2196–2210.

[23] J. Ploennigs, B. Chen, P. Palmes, R. Lloyd, e2-Diagnoser: A System for Monitoring, Forecasting and Diagnosing

Energy Usage, in: Data Mining Workshop (ICDMW), 2014 IEEE International Conference on, IEEE, Shenzhen,

China, 2014, pp. 1231–1234.

[24] A. Shahzadeh, A. Khosravi, S. Nahavandi, Improving load forecast accuracy by clustering consumers using smart

meter data, in: Proceedings of the 2015 International Joint Conference on Neural Networks (IJCNN), Killarney,

Ireland, pp. 1–7.

[25] A. Reinhardt, S. Koessler, PowerSAX: Fast motif matching in distributed power meter data using symbolic repre-

sentations, in: Proceedings of 9th IEEE International Workshop on Practical Issues in Building Sensor Network

Applications (SenseApp 2014), IEEE, Edmonton, Canada, 2014, pp. 531–538.

[26] Z. Liu, H. Li, K. Liu, H. Yu, K. Cheng, Design of high-performance water-in-glass evacuated tube solar water

heaters by a high-throughput screening based on machine learning: A combined modeling and experimental study,

Solar Energy 142 (2017) 61–67.

[27] Y. Chen, T. Hong, M. A. Piette, City-Scale Building Retroﬁt Analysis: A Case Study using CityBES, in: Building

Simulation 2017, San Francisco, CA, USA.

[28] C. Miller, Screening Meter Data: Characterization of Temporal Energy Data from Large Groups of Non-Residential

Buildings, Ph.D. thesis, ETH Zrich, Zurich, Switzerland, 2017.

[29] C. Miller, F. Meggers, The Building Data Genome Project: An open, public data set from non-residential building

electrical meters, Energy Procedia 122 (2017) 439–444.

[30] L. Breiman, Random forests, Machine learning 45 (2001) 5–32.

[31] T. Hastie, R. Tibshirani, J. H. Friedman, The elements of statistical learning: data mining, inference, and prediction,

Springer series in statistics, Springer, New York, NY, 2nd ed edition, 2009.

[32] G. Louppe, L. Wehenkel, A. Sutera, P. Geurts, Understanding variable importances in forests of randomized trees,

in: Advances in Neural Information Processing Systems, pp. 431–439.

[33] S. Borgeson, J. Kwac, visdom: R package for energy data analytics, 2015. R package version 0.9.

[34] T. Mitsa, Temporal Data Mining, Chapman and Hall/CRC, 2010.

[35] C. Miller, A. Schlueter, Forensically discovering simulation feedback knowledge from a campus energy informa-

tion system, in: Proceedings of the 2015 Symposium on Simulation for Architecture and Urban Design (SimAUD

2015), SCS, Washington DC, USA, 2015, pp. 33–40.

[36] M. F. Fels, PRISM: an introduction, Energy and Buildings 9 (1986) 5–18.

[37] J. W. Taylor, L. M. De Menezes, P. E. McSharry, A comparison of univariate methods for forecasting electricity

demand up to a day ahead, International Journal of Forecasting 22 (2006) 1–16.

[38] P. Price, Methods for Analyzing Electric Load Shape and its Variability, Lawrence Berkeley National Laboratory

(2010).

[39] J. L. Mathieu, P. N. Price, S. Kiliccote, M. A. Piette, Quantifying changes in building electricity use, with applica-

tion to demand response, Smart Grid, IEEE Transactions on 2 (2011) 507–518.

[40] J. K. Kissock, C. Eger, Measuring industrial energy savings, Applied Energy 85 (2008) 347–361.

[41] R. B. Cleveland, W. S. Cleveland, J. E. McRae, I. Terpenning, STL: A seasonal-trend decomposition procedure

based on loess, Journal of Oﬃcial Statistics 6 (1990) 3–73.

[42] P. Patel, E. Keogh, J. Lin, S. Lonardi, Mining motifs in massive time series databases, in: Proceedings of 2002

IEEE International Conference on Data Mining (ICDM 2002), Maebashi City, Japan, pp. 370–377.

[43] E. J. Keogh, J. Lin, A. Fu, Hot sax: Eﬃciently ﬁnding the most unusual time series subsequence, in: Proceedings

of the Fifth IEEE International Conference on Data Mining (ICDM05), IEEE, Houston, TX, USA, 2005.

[44] C. Miller, Z. Nagy, A. Schlueter, Automated daily pattern ﬁltering of measured building performance data, Au-

tomation in Construction 49, Part A (2015) 1–17.

[45] J. Lin, E. J. Keogh, S. Lonardi, B. Chiu, A symbolic representation of time series, with implications for stream-

ing algorithms, in: Proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and

Knowledge Discovery (DMKD ’03), ACM, San Diego, CA, USA, 2003, pp. 2–11.

[46] P. Senin, S. Malinchik, SAX-VSM: Interpretable Time Series Classiﬁcation Using SAX and Vector Space Model,

in: 2013 IEEE 13th International Conference on Data Mining, Institute of Electrical & Electronics Engineers

(IEEE), 2013.

[47] N. A. James, A. Kejariwal, D. S. Matteson, Leveraging Cloud Data to Mitigate User Experience from Breaking

Bad, arXiv preprint arXiv:1411.7955 (2014).

An Automated Method for Extracting and Analyzing Railway Infrastructure Cost Data

Article

Full-text available

Sep 2023

The capability of extracting information and analyzing it so that it is in a common format is essential for performing predictions, comparing projects through cost benchmarking, and having a deeper understanding of the project costs. However, the lack of standardization and the manual inclusion of data make this process very time-consuming, unreliable, and inefficient. To tackle this problem, a novel approach with a big impact is presented combining the benefits of data mining, statistics, and machine learning to extract and analyze the information related to railway infrastructure cost data. To validate the suggested approach, data from 23 real historical projects from the client network rail were extracted, allowing their costs to be comparable. Finally, some machine learning and data analytics methods were implemented to identify the most relevant factors allowing cost benchmarking to be performed. The presented method proves the benefits of data extraction for gathering, analyzing, and benchmarking each project in an efficient manner, and to develop a deeper understanding of the relationships and the relevant factors that matter in infrastructure costs.

Antecedent factors operations strategy and impact on performance: Indonesian construction case

Chapter

Full-text available

Sep 2023

An Automated Method for Extracting and Analyzing Railway Infrastructure Cost Data

Preprint

Full-text available

Aug 2023

The capability of extracting information and analyze it into a common format is essential for performing predictions, comparing projects through cost benchmarking, and for having a deeper understanding of the project costs. However, the lack of standardization and the manual inclusion of the data makes this process very time-consuming, unreliable, and inefficient. To tackle this problem, a novel approach with a big impact is presented combining the benefits of data mining, statistics, and machine learning to extract and analyze the information related to railway costs infrastructure data. To validate the suggested approach, data from 23 real historical projects from the client network rail was extracted, allowing their costs to be comparable. Finally, some machine learning and data analytics methods were implemented to identify the most relevant factors allowing for costs benchmarking. The presented method proves the benefits of data extraction being able to gather, analyze and benchmark each project in an efficient manner, and deeply understand the relationships and the relevant factors that matter in infrastructure costs.

Smart grid public datasets: Characteristics and associated applications

Article

Full-text available

May 2024

The development of smart grids, traditional power grids, and the integration of internet of things devices have resulted in a wealth of data crucial to advancing energy management and efficiency. Nevertheless, public datasets remain limited due to grid operators' and companies' reluctance to disclose proprietary information. The authors present a comprehensive analysis of more than 50 publicly available datasets, organised into three main categories: micro‐ and macro‐consumption data, detailed in‐home consumption data (often referred to as non‐intrusive load monitoring datasets or building data) and grid data. Furthermore, the study underscores future research priorities, such as advancing synthetic data generation, improving data quality and standardisation, and enhancing big data management in smart grids. The aim of the authors is to enable researchers in the smart and power grid a comprehensive reference point to pick suitable and relevant public datasets to evaluate their proposed methods. The provided analysis highlights the importance of following a systematic and standardised approach in evaluating future methods and directs readers to future potential venues of research in the area of smart grid analytics.

A data-driven agent-based model of occupants’ thermal comfort behaviors for the planning of district-scale flexible work arrangements

Article

Apr 2024
BUILD ENVIRON

E-Audit: A “no-touch” energy audit that integrates machine learning and simulation

Article

Jun 2024
ENERG BUILDINGS

Interpretable domain-informed and domain-agnostic features for supervised and unsupervised learning on building energy demand data

Article

Feb 2024
APPL ENERG

Energy demand from the built environment is among the most important contributors to greenhouse gas emissions. One promising way to curtail these emissions is through innovative energy management systems (EMS’s). These systems often rely on access to real-world demand data, which remains elusive in practice. Even when available, energy demand data typically suffers from missing data as well as many irregularities and anomalies. This precludes the application of many off-the-shelf machine learning algorithms for time series analysis and modelling, necessary for downstream energy management. Transforming energy demand time series to low dimensional feature matrices has been shown to work well in determining similar buildings and predicting meta-data, both of which can be used to create better forecast algorithms used as input in EMS’s. These studies are, however, often marred by the limited size of datasets, as well as the non-interpretable nature of extracted features. This paper addresses these concerns and makes several important contributions: (1) it collates several open-source datasets to create a large meta-analysis dataset containing energy demand data for over 13,000 buildings; (2) it investigates the use of different interpretable feature extraction methods on this collated dataset; and (3) it shows that this feature matrix can be used more generally to determine similar buildings and predict building properties such as missing meta-data. The large feature matrix resulting from the work is open-sourced as part of a web-based dashboard to enable the community to reproduce and further develop our results.

A holistic time series-based energy benchmarking framework for applications in large stocks of buildings

Article

Full-text available

Dec 2023
APPL ENERG

With the proliferation of Internet of Things (IoT) sensors and metering infrastructures in buildings, external energy benchmarking, driven by time series analytics, has assumed a pivotal role in supporting different stakeholders (e.g., policymakers, grid operators, and energy managers) who seek rapid and automated insights into building energy performance over time. This study presents a holistic and generalizable methodology to conduct external benchmarking analysis on electrical energy consumption time series of public and commercial buildings. Differently from conventional approaches that merely identify peer buildings based on their Primary Space Usage (PSU) category, this methodology takes into account distinctive features of building electrical energy consumption time series including thermal sensitivity, shape, magnitude, and introduces KPIs encompassing aspects related to the electrical load volatility, the rate of anomalous patterns, and the building operational schedule. Each KPI value is then associated with a performance score to rank the energy performance of a building according to its peers. The proposed methodology is tested using the open dataset Building Data Genome Project 2 (BDGP2) and in particular 622 buildings belonging to Office and Education category. The results highlight that, considering the performance scores built upon the set of proposed KPIs, this innovative approach significantly enhances the accuracy of the benchmarking process when it is compared with a conventional approach only based on the comparison with the buildings belonging to the same PSU. As a matter of fact, an average variation of about 14% for the calculated performance scores is observed for a testing set of buildings.

Advances in Machine-Learning Based Disaggregation of Building Heating Loads: A Review

Conference Paper

Jan 2024

This review article investigates the methods proposed for disaggregating the space heating units’ load from the aggregate electricity load of commercial and residential buildings. It explores conventional approaches together with those that employ traditional machine learning, deep supervised learning and reinforcement learning. The review also outlines corresponding data requirements and examines the suitability of a commonly utilised toolkit for disaggregating heating loads from low-frequency aggregate power measurements. It is shown that most of the proposed approaches have been applied to high-resolution measurements and that few studies have been dedicated to low-resolution aggregate loads (e.g. provided by smart meters). Furthermore, only a few methods have taken account of special considerations for heating technologies, given the corresponding governing physical phenomena. Accordingly, the recommendations for future works include adding a rigorous pre-processing step, in which features inspired by the building physics (e.g. lagged values for the ambient conditions and values that represent the correlation between heating consumption and outdoor temperature) are added to the available input feature pool. Such a pipeline may benefit from deep supervised learning or reinforcement learning methods, as these methods are shown to offer higher performance compared to traditional machine learning algorithms for load disaggregation.

An ensemble-based faults detection and diagnosis approach for determining faults severities at whole-building level

Article

Aug 2023

The Building Data Genome Project: An open, public data set from non-residential building electrical meters

Article

Full-text available

Sep 2017

As of 2015, there are over 60 million smart meters installed in the United States; these meters are at the forefront of big data ana-lytics in the building industry. However, only a few public data sources of hourly non-residential meter data exist for the purpose of testing algorithms. This paper describes the collection, cleaning, and compilation of several such data sets found publicly on-line, in addition to several collected by the authors. There are 507 whole building electrical meters in this collection, and a majority are from buildings on university campuses. This group serves as a primary repository of open, non-residential data sources that can be built upon by other researchers. An overview of the data sources, subset selection criteria, and details of access to the repository are included. Future uses include the application of new, proposed prediction and classification models to compare performance to previously generated techniques.

A review of unsupervised statistical learning and visual analytics techniques applied to performance analysis of non-residential buildings

Article

Full-text available

Jan 2018
RENEW SUST ENERG REV

Measured and simulated data sources from the built environment are increasing rapidly. It is becoming normal to analyze data from hundreds, or even thousands of buildings at once. Mechanistic, manual analysis of such data sets is time-consuming and not realistic using conventional techniques. Thus, a significant body of literature has been generated using unsupervised statistical learning techniques designed to uncover structure and information quickly with fewer input parameters or metadata about the buildings collected. Further, visual analytics techniques are developed as aids in this process for a human analyst to utilize and interpret the results. This paper reviews publications that include the use of unsupervised machine learning techniques as applied to non-residential building performance control and analysis. The categories of techniques covered include clustering, novelty detection, motif and discord detection, rule extraction, and visual analytics. The publications apply these technologies in the domains of smart meters, portfolio analysis, operations and controls optimization, and anomaly detection. A discussion is included of key challenges resulting from this review, such as the need for better collaboration between several, disparate research communities and the lack of open, benchmarking data sets. Opportunities for improvement are presented including methods of reproducible research and suggestions for cross-disciplinary cooperation.

City-Scale Building Retrofit Analysis: A Case Study using CityBES

Conference Paper

Full-text available

Aug 2017

This paper presents a case study using CityBES to analyze potential retrofit savings of different energy conservation measures for city-scale building stocks. CityBES is a web-based tool designed to support city-scale building energy efficiency, including currently implemented features of energy retrofit analysis, and visualization of city building energy disclosure datasets. This case study uses CityBES to evaluate five common retrofit measures for 540 small and medium-sized office and retail buildings in downtown San Francisco. The results show: (1) all five measures together can save 22-48% of site energy per building, (2) replacing lighting with LED lights and adding air economizers to existing HVAC systems are most cost-effective, (3) the payback is long for upgrading HVAC systems due to the mild climate of San Francisco that does not have much cooling or heating loads.

Screening meter data: Characterization of temporal energy data from large groups of non-residential buildings

Thesis

Full-text available

Oct 2016

Clayton Miller

This study focuses on the screening of characteristic data from the ever-expanding sources of raw, temporal sensor data from commercial buildings. A two-step framework is presented that extracts statistical, model-based, and pattern-based behavior from two real-world data sets. The first collection is from 507 commercial buildings extracted from various case studies and online data sources from around the world. The second collection is advanced metering infrastructure (AMI) data from 1,600 buildings. The goal of the framework is to reduce the expert intervention needed to utilize measured raw data in order to extract information such as building use type, performance class, and operational behavior. The first step is feature extraction and it utilizes a library of temporal data mining techniques to filter various phenomenon from the raw data. This step transforms quantitative raw data into qualitative categories that are presented in heat map visual-izations for interpretation. In the second step, or the investigation, a supervised learning technique is tested in the ability to assign impact scores to the most important features from the first step. The efficacy of estimating variable causality of the characterized performance is tested to determine scalability amongst a heterogeneous sample of buildings. In the first set of case studies, characterization as compared to a baseline was three times more accurate in characterizing primary buildng use type, almost twice for performance class, and over four times for building operations type. For the AMI data, characterizing the standard industry class was improved by 27% and predicting the success of energy savings measures was improved by 18%. Qualitative insight from several campus case study interviews are discussed as well. The usefulness of the approaches was discussed in the context of campus building operations. ii Kurzfassung Diese Studie behandelt das Sichten, Sortieren und Bearbeiten charakteristischer Zeitrei-hen aus stark wachsenden Quellen fu¨rfu¨r rohe Sensordaten in kommerziellen Gebäuden. Ein zweistufiges Vorgehen wird präsentiert, das statistische, modellbasierte und Muster-basierte Verhaltensweisen von zwei Datensätzen extrahiert. Der erste Datensatz beinhal-tet Daten von 507 kommerziellen Gebäuden, zusammengetragen aus verschiedenen Fall-beispielen und online Datenquellen aus der ganzen Welt. Der zweite Datensatz beinhaltet Daten von Advanced Metering Infrastructure (AMI) von 1,600 Gebäuden. Das Ziel der vorgestellten Methode ist das Reduzieren benötigter Experteneingriffe, um gemessene Ro-hdaten benutzen zu können zum Erhalten von Information wie Gebäudenutzungstyp, Performance Klasse und Betriebsverhalten. Im ersten Schritt, dem Extrahieren von Charak-teristiken, werden durch das Benutzen einer Bibliothek von Data Mining Techniken ver-schiedene Phänomene aus den Rohdaten herausgefiltert. Dieser Schritt transformiert quantitative Rohdaten zu qualitativen Kategorien, die durch Heat Map Visualisierungen präsentiert und interpretiert werden. Im zweiten Schritt, der Datenuntersuchung, wird eine Supervised Learning Technique auf die Möglichkeit hin getestet, den wichtigsten Charak-teristiken aus dem ersten Schritt eine Auswertung der Auswirkungen zuzuordnen. Um das Hochskalieren für heterogene Gebäudeparks zu untersuchen wird die Wirksamkeit getestet, variable Kausalzusammenhänge der charakterisierten Performance zu schätzen. In den Fallstudien im ersten Datensatz war die Bestimmung des primären Gebäudenutzungstyps dreimal treffender, die Bestimmung der Performance Klasse fast zweimal treffender und die Bestimmung des Betriebsverhaltenstyps über viermal treffender als für ein Basisvorgehen. Für die AMI Daten wurde die Charakterisierung der Standard Industrie Klasse um 27% verbessert, die Prognose der Erfolgsrate von Energiesparmassnahmen um 18% verbessert. Interviews mit Akteuren von mehreren Schulanlagen werden diskutiert bezüglich ihrer qualitativen Einblicke und bezüglich der Nützlichkeit der vorgestellten Ansätze im Kon-text des Betriebs von Schulanlagen. iii

Improving load forecast accuracy by clustering consumers using smart meter data

Conference Paper

Full-text available

Jul 2015

Predictive segmentation of energy consumers

Article

Full-text available

Sep 2016
APPL ENERG

This paper proposes a predictive segmentation technique for identifying sub-groups in a large population that are both homogeneous with respect to certain patterns in customer attributes, and predictive with respect to a desired outcome. Our motivation is creating a highly-interpretable and intuitive segmentation and targeting process for customers of energy utility companies that is also optimal in some sense. In this setting, the energy utility wants to design a small number of message types to be sent to appropriately-chosen customers who are most likely to respond to different types of communications. The proposed method uses consumption, demographics, and program enrollment data to extract basic predictive patterns using standard machine learning techniques. We next define a feasible potential assignment of patterns to a small number of segments described by expert guidelines and hypotheses about consumer characteristics, which are available from prior behavioral research. The algorithm then identifies an optimal allocation of patterns to segments that is feasible and maximizes predictive power. This is formulated as maximizing the minimum enrollment rate from across the segments, which is then expressed as solving a mixed-integer linear-fractional program. We propose a bisection-based method to quickly solve this program by means of identifying feasible sets. We exemplify the methodology on a large-scale dataset from a leading U.S. energy utility, and obtain segments of customers whose likelihood of enrollment is more than twice larger than that of the average population, and that are described by a small number of simple, intuitive rules. The segments designed this way achieve a 2–3× improvement in the probability of enrollment over the overall population.

The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (Springer Series in Statistics)

Book

Feb 2009

During the past decade there has been an explosion in computation and information technology. With it have come vast amounts of data in a variety of fields such as medicine, biology, finance, and marketing. The challenge of understanding these data has led to the development of new tools in the field of statistics, and spawned new areas such as data mining, machine learning, and bioinformatics. Many of these tools have common underpinnings but are often expressed with different terminology. This book describes the important ideas in these areas in a common conceptual framework. While the approach is statistical, the emphasis is on concepts rather than mathematics. Many examples are given, with a liberal use of color graphics. It should be a valuable resource for statisticians and anyone interested in data mining in science or industry. The book's coverage is broad, from supervised learning (prediction) to unsupervised learning. The many topics include neural networks, support vector machines, classification trees and boosting---the first comprehensive treatment of this topic in any book. This major new edition features many topics not covered in the original, including graphical models, random forests, ensemble methods, least angle regression & path algorithms for the lasso, non-negative matrix factorization, and spectral clustering. There is also a chapter on methods for "wide" data (p bigger than n), including multiple testing and false discovery rates. Trevor Hastie, Robert Tibshirani, and Jerome Friedman are professors of statistics at Stanford University. They are prominent researchers in this area: Hastie and Tibshirani developed generalized additive models and wrote a popular book of that title. Hastie co-developed much of the statistical modeling software and environment in R/S-PLUS and invented principal curves and surfaces. Tibshirani proposed the lasso and is co-author of the very successful An Introduction to the Bootstrap. Friedman is the co-inventor of many data-mining tools including CART, MARS, projection pursuit and gradient boosting.

Leveraging cloud data to mitigate user experience from ‘breaking bad’

Conference Paper

Dec 2016

Design of high-performance water-in-glass evacuated tube solar water heaters by a high-throughput screening based on machine learning: A combined modeling and experimental study

Article

Jan 2017
SOL ENERGY

How to design water-in-glass evacuated tube solar water heater (WGET-SWH) with high heat collection rates has long been a question. Here, we propose a high-throughput screening (HTS) method based on machine learning to design and screen 3.538125 × 10⁸ possible combinations of extrinsic properties of WGET-SWH, to discover promising WGET-SWHs by comparing their predicted heat collection rates. Two new-designed WGET-SWHs were installed experimentally and showed higher heat collection rates (11.32 and 11.44 MJ/m², respectively) than all the 915 measured samples in our previous database. This study shows that we can use the HTS method to modify the design of WGET-SWH with just few knowledge about the highly complicated correlations between the extrinsic properties and heat collection rates of solar water heaters.

Random forests, machine learning 45

Article

Jan 2001

L. Breiman

Mining electrical meter data to predict principal building use, performance class, and operations strategy for hundreds of non-residential buildings

Abstract and Figures

Recommended publications

Hybrid association-classification algorithm for anomaly extraction

Multi-dimensional Data Inspection for Supervised Classification with Eigen Transformation Classifica...

Discovery of Hot Region in Trajectory Databases

Unsupervised learning on K-partite graphs