PreprintPDF Available

Evaluating and Aggregating Feature-based Model Explanations

May 2020

May 2020

Authors:

Umang Bhatt

Carnegie Mellon University

Adrian Weller

University of Cambridge

Jose M F Moura

Carnegie Mellon University

Preprints and early-stage research may not have been peer reviewed yet.

A feature-based model explanation denotes how much each input feature contributes to a model's output for a given data point. As the number of proposed explanation functions grows, we lack quantitative evaluation criteria to help practitioners know when to use which explanation function. This paper proposes quantitative evaluation criteria for feature-based explanations: low sensitivity, high faithfulness, and low complexity. We devise a framework for aggregating explanation functions. We develop a procedure for learning an aggregate explanation function with lower complexity and then derive a new aggregate Shapley value explanation function that minimizes sensitivity.

Content uploaded by Umang Bhatt

Content may be subject to copyright.

Evaluating and Aggregating Feature-based Model Explanations

Umang Bhatt1,2∗,Adrian Weller1,3and Jos´

e M. F. Moura2

1University of Cambridge

2Carnegie Mellon University

3The Alan Turing Institute

{usb20, aw665}@cam.ac.uk, moura@ece.cmu.edu

Abstract

A feature-based model explanation denotes how

much each input feature contributes to a model’s

output for a given data point. As the number

of proposed explanation functions grows, we lack

quantitative evaluation criteria to help practition-

ers know when to use which explanation function.

This paper proposes quantitative evaluation crite-

ria for feature-based explanations: low sensitiv-

ity, high faithfulness, and low complexity. We de-

vise a framework for aggregating explanation func-

tions. We develop a procedure for learning an ag-

gregate explanation function with lower complex-

ity and then derive a new aggregate Shapley value

explanation function that minimizes sensitivity.

1 Introduction

There has been great interest in understanding black-box ma-

chine learning models via post-hoc explanations. Much of

this work has focused on feature-level importance scores for

how much a given input feature contributes to a model’s out-

put. These techniques are popular amongst machine learn-

ing scientists who want to sanity check a model before de-

ploying it in the real world [Bhatt et al., 2020]. Many

feature-based explanation functions are gradient-based tech-

niques that analyze the gradient ﬂow through a model to de-

termine salient input features [Shrikumar et al., 2017; Sun-

dararajan et al., 2017]. Other explanation functions perturb

input values to a reference output and measure the change

in the model’s output [ˇ

Strumbelj and Kononenko, 2014;

Lundberg and Lee, 2017].

With many candidate explanation functions, machine

learning practitioners ﬁnd it difﬁcult to pick which explana-

tion function best captures how a model reaches a speciﬁc

output for a given input. Though there has been work in

qualitatively evaluating feature-based explanation functions

on human subjects [Lage et al., 2019], there has been little

exploration into formalizing quantitative techniques for eval-

uating model explanations. Recent work has created auxiliary

tasks to test if attribution is assigned to relevant inputs [Yang

∗Contact Author

and Kim, 2019]and has developed tools to verify if the fea-

tures important to an explanation function are relevant to the

model itself [Camburu et al., 2019].

Borrowing from the humanities, we motivate three cri-

teria for assessing a feature-based explanation: sensitivity,

faithfulness, and complexity. Philosophy of science research

has advocated for explanations that vary proportionally with

changes in the system being explained [Lipton, 2003]; as

such, explanation functions should be insensitive to perturba-

tions in the model inputs, especially if the model output does

not change. Capturing relevancy faithfully is helpful in an ex-

planation [Ruben, 2015]. Since humans cannot process a lot

of information at once, some have argued for minimal model

explanations that contain only relevant and representative fea-

tures [Batterman and Rice, 2014]; therefore, an explanations

should not be complex (i.e., use few features).

In this paper, we ﬁrst deﬁne these three distinct criteria:

low sensitivity, high faithfulness, and low complexity. With

many explanation function choices, we then propose methods

for learning an aggregate explanation function that combines

explanation functions. If we want to ﬁnd the simplest expla-

nation from a set of explanations, then we can aggregate ex-

planations to minimize the complexity of the resulting expla-

nation. If we want to learn a smoother explanation function

that varies slowly as inputs are perturbed, we can leverage

an aggregation scheme that learns a less sensitive explanation

function. To the best of our knowledge, we are the ﬁrst to

rigorously explore aggregation of various explanations, while

placing explanation evaluation on an objective footing. To

that end, we highlight the contributions of this paper:

•We describe three desirable criteria for feature-based ex-

planation functions: low sensitivity, high faithfulness,

and low complexity.

•We develop an aggregation framework for combining

explanation functions.

•We create two techniques that reduce explanation com-

plexity by aggregating explanation functions.

•We derive an approximation for Shapley-value explana-

tions by aggregating explanations from a point’s near-

est neighbors, minimizing explanation sensitivity and re-

sembling how humans reason in medical settings.

arXiv:2005.00631v1 [cs.LG] 1 May 2020

2 Preliminaries

Restricting to supervised classiﬁcation settings, let fbe a

black box predictor that maps an input x∈Rdto an output

f(x)∈ Y. An explanation function gfrom a family of expla-

nation functions, G, takes in a predictor fand a point of inter-

est xand returns importance scores g(f,x) = φx∈Rdfor

all features, where g(f,x)i=φx,i (simpliﬁed to φiin con-

text) is the importance of (or attribution for) feature xiof x.

By gj, we refer to a particular explanation function, usually

from a set of explanation functions Gm={g1,g2,...,gm}.

We denote D:Rd×Rd7→ R≥0to be a distance met-

ric over explanations, while ρ:Rd×Rd7→ R≥0denotes

a distance metric over the inputs. An evaluation criterion µ

takes in a predictor f, explanation function g, and input x,

and outputs a scalar: µ(f,g;x).D={(xi, y i)}n

i=1 refers to

a dataset of input-output pairs, and Dxdenotes all xiin D.

3 Evaluating Explanations

With the number of techniques to develop feature level ex-

planations growing in the explainability literature, picking

which explanation function gto use can be difﬁcult. In order

to study the aggregation of explanation functions, we deﬁne

three desiderata of an explanation function g.

3.1 Desideratum: Low Sensitivity

We want to ensure that, if inputs are near each other and their

model outputs are similar, then their explanations should be

close to each other. Assuming fis differentiable, we desire

an explanation function gto have low sensitivity in the re-

gion around a point of interest x, implying local smoothness

of g. While [Melis and Jaakkola, 2018]codiﬁed the property,

[Ghorbani et al., 2019]empirically tested explanation func-

tion sensitivity. We follow the convention of the former and

deﬁne max sensitivity and average sensitivity in the neighbor-

hood of a point of interest x.

Let Nr={z∈ Dx|ρ(x,z)≤r, f(x) = f(z)}be a

neighborhood of datapoints within a radius rof x.

Deﬁnition 1 (Max Sensitivity).Given a predictor f, an ex-

planation function g, distance metrics Dand ρ, a radius r,

and a point x, we deﬁne the max sensitivity of gat xas:

µM(f,g, r;x) = max

z∈Nr

D(g(f,x),g(f,z))

Deﬁnition 2 (Average Sensitivity).Given a predictor f, an

explanation function g, distance metrics Dand ρ, a radius r,

a distribution Px(·)over the inputs centered at point x, we

deﬁne the average sensitivity of gat xas:

µA(f,g, r;x) = Z

z∈Nr

D(g(f,x),g(f,z))Px(z)dz

3.2 Desideratum: High Faithfulness

Faithfulness has been deﬁned in [Yeh et al., 2019]. The fea-

ture importance scores from gshould correspond to the im-

portant features of xfor f; as such, when we set particular

features xsto a baseline value ¯

xs, the change in predictor’s

output should be proportional to the sum of attribution scores

of features in xs. We measure this as the correlation between

the sum of the attributions of xsand the difference in out-

put when setting those features to a reference baseline. For

a subset of indices S⊆ {1,2,...d},xs={xi, i ∈S}

denotes a sub-vector of input features that partitions the in-

put, x=xs∪xc.x[xs=¯

xs]denotes an input where xs

is set to a reference baseline while xcremains unchanged:

x[xs=¯

xs]=¯

xs∪xc. When |S|=d,x[xs=¯

xs]=¯

Remark (Reference Baselines).Recent work has discussed

how to pick a proper reference baseline ¯

x.[Sundararajan et

al., 2017]suggests using a baseline where f(¯

x)≈0, while

others have proposed taking the baseline to be the mean of

the training data. [Chang et al., 2019]notes that the baseline

can be learned using generative modeling.

Deﬁnition 3 (Faithfulness).Given a predictor f, an explana-

tion function g, a point x, and a subset size |S|, we deﬁne the

faithfulness of gto fat xas:

µF(f,g;x) = corr

S∈([d]

|S|) X

i∈S

g(f,x)i,f(x)−fx[xs=¯

xs]!

For our experiments, we ﬁx |S|then randomly sample sub-

sets xsof the ﬁxed size from xto estimate correlation. Since

we do not see all [d]

|S|subsets in our calculation of faithful-

ness, we may not get an accurate estimate of the criterion.

Though hard to codify and even harder to aggregate, faithful-

ness is desirable, as it demonstrates that an explanation cap-

tures which features the predictor uses to generate an output

for a given input. Learning global feature importances that

highlight, in expectation, which features a predictor relies on

is a challenging problem left to future work.

3.3 Desideratum: Low Complexity

A complex explanation is one that uses all dfeatures in its ex-

planation of which features of xare important to f. Though

this explanation may be faithful to the model (as deﬁned

above), it may be too difﬁcult for the user to understand (es-

pecially if dis large). We deﬁne a fractional contribution dis-

tribution, where |·|denotes absolute value:

Pg(i) = |g(f,x)i|

j∈[d]

|g(f,x)j|;Pg={Pg(1),...,Pg(d)}

Note that Pgis a valid probability distribution. Let Pg(i)

denote the fractional contribution of feature xito the total

magnitude of the attribution. If every feature had equal attri-

bution, the explanation would be complex (even if it is faith-

ful). The simplest explanation would be concentrated on one

feature. We deﬁne complexity as the entropy of Pg.

Deﬁnition 4 (Complexity).Given a predictor f, explanation

function g, and a point x, the complexity of gat xis:

µC(f,g;x) = Ei−ln(Pg)=−

i=1

Pg(i) ln(Pg(i))

4 Aggregating Explanations

Given a trained predictor f, a set of explanation functions

Gm={g1,...,gm}, a criterion to optimize µ, and a set of

inputs Dx, we want to ﬁnd an aggregate explanation function

gagg that satisﬁes µat least as well as any gi∈ Gm. Let

h(·)represent some function that combines mexplanations

into a consensus gagg =h(Gm). We now explore different

candidates for h(·).

4.1 Convex Combination

Suppose we have two different explanation functions g1and

g2and have chosen a criterion µto evaluate a g. Consider an

aggregate explanation, gagg =h(g1,g2). A potential h(·)is

a convex combination where gagg =h(g1,g2) = wg1+(1 −

w)g2=w|Gm.

Proposition 1. If Dis the 2distance and µ=µA(average

sensitivity), the following holds:

µA(gagg)≤wµA(g1) + (1 −w)µA(g2)

Proof. Assuming Px(z)is uniform, we can apply the triangle

inequality and the convexity of Dto arrive at the above.

A convex combination of explanation functions thus yields

an aggregate explanation function that is at most as sensitive

as any of the explanation functions taken alone. In order to

learn wgiven g1and g2, we set up an objective as follows.

w∗= arg min

x∼DxµA(gagg (f,x))(1)

Assuming a uniform distribution around all x∈ Dx, we can

rewrite this as:

w∗= arg min

x∼DxZ

z∈Nr

D(gagg (x),gagg (z))Px(z)dzdx

By Cauchy-Schwartz, we get the following:

w∗≤arg min

x∼DxZ

z∈Nr

D(a, b)dzdx

where a=wg1(f,x) + (1 −w)g2(f,x)and b=

wg1(f,z) + (1 −w)g2(f,z). This implies that w∗will

be minimal when one element of w∗is 0 and the other is

1. Therefore, a convex combination of two explanation func-

tions, found by solving Equation (1), will be at most as sen-

sitive as the least sensitive explanation function.

4.2 Centroid Aggregation

Another sensible candidate for h(·)to combine mexplana-

tion functions is based on centroids with respect to some dis-

tance function D:G × G 7→ R, so that:

gagg ∈arg min

g∈G

gi∈GmD(g,gi)p= arg min

g∈G

i=1

D(g,gi)p

where pis a positive constant. The simplest examples of dis-

tances are the 2and 1distances with real-valued attributions

where G ⊆ Rd.

Proposition 2. When Dis the 2distance and p= 2, the

aggregate explanation is the feature-wise sample mean.

gagg(f,x) = gavg (f,x) = 1

i=1

gi(f,x)(2)

Proposition 3. When Dis the 1distance and p= 1, the

aggregate explanation is the feature-wise sample median.

gagg(f,x) = med{Gm}

Propositions 2 and 3 follow from standard results in statis-

tics that the mean minimizes the sum of squared differences

and the median minimizes the sum of absolute deviations

[Berger, 2013].

We could obtain rank-valued attributions by taking any

quantitative vector-valued attributions and ranking features

according to their values. If Dis the Kendall-tau distance

with rank-valued attributions where G ⊆ Sd(the set of per-

mutations over dfeatures), then the resulting aggregation

mechanism via computing the centroid is called the Kemeny-

Young rule. For rank-valued attributions, any aggregation

mechanism falls under the rank aggregation problem in so-

cial choice theory for which many practical “voting rules”

exist [Bhatt et al., 2019a].

We analyze the error of a candidate gagg. Suppose the

optimal explanation for xusing fis g∗(f,x)and suppose

gagg is the mean explanation for xin Equation (2). Let

i,x=||g∗(f,x)−gi(f,x)|| be the error between the opti-

mal explanation and the ith explanation function.

Proposition 4. The error between the aggregate explanation

gagg(f,x)and the optimal explanation g*(f,x)satisﬁes:

agg ≤Pn

i=1 Pm

j=1 j,xi

Proof. For a ﬁxed x, we have:

agg,x=||g∗(f,x)−gagg(f,x)||

=||mg∗(f,x)

m−1

i=1

gi(f,x)||

≤1

i=1

||g∗(f,x)−gi(f,x)|| =Pm

i=1 i,x

Averaging across Dx, we obtain the result.

Hence, by aggregating, we do better than when using one

explanation function alone. Many gradient-based explanation

functions ﬁt to noise [Hooker et al., 2019]. One way to reduce

noise would be to aggregate by ensembling or averaging. As

proven in Proposition 4, the typical error of the aggregate is

less than the expected error of each function alone.

5 Lowering Complexity Via Aggregation

In this section, we describe iterative algorithms for aggre-

gating explanation functions to obtain gagg(f,x)with lower

complexity whilst combining mcandidate explanation func-

tions Gm={g1,...,gm}. We desire a gagg(f,x)that con-

tains information from all candidate explanations gi(f,x)

yet has entropy less than or equal to that of each explana-

tion gi(f,x). As discussed, a reasonable candidate for an

aggregate explanation function is the sample mean given by

Equation (2). We may want gagg(f,x)to approach the sam-

ple mean, gavg(f,x); however, the sample mean may have

greater complexity than that of each gi(f,x).

For example, let g1(f,x)=[−1,0]Tand g2(f,x) =

[0,1]T. The sample mean is gavg (f,x)=[−0.5,0.5]T. Both

g1and g2have the minimum possible complexity of 0, while

gavg has the maximum possible complexity, log(2). Our ag-

gregation technique must ensure that gagg(f,x)approaches

gavg(f,x)while guaranteeing gagg (f,x)has complexity less

than or equal to that of each gi(f,x). We now present two

approaches for learning a lower complexity explanation, vi-

sually represented in Figure 1.

5.1 Gradient-Descent Style Method

Our ﬁrst approach is similar to gradient descent. Starting

from each gi(f,x), we iteratively move towards gavg(f,x)

in each of the ddirections (i.e., changing the kth feature

by a small amount) if the complexity decreases with that

move. We stop moving when the complexity no longer de-

creases or gavg(f,x)is reached. Simultaneously, we start

from gavg(f,x)and iteratively move towards each gi(f,x)

in each of the ddirections if the complexity decreases. We

stop moving when the complexity no longer decreases or any

of the gi(f,x)are reached. The ﬁnal gagg(f,x)is the loca-

tion that has the smallest complexity from these 2ddifferent

walks. Since we only move if the complexity decreases and

start from each gi(f,x), the entropy of gagg(f,x)is guaran-

teed to be less than or equal to the entropy of all gi(f,x).

5.2 Region Shrinking Method

In our second approach, we consider the closed region, R,

which is the convex hull of all the explanation functions,

gi(f,x). Notice region Rinitially contains gavg . We con-

sider an iterative approach to ﬁnd the global minimum in the

region R. As before, we consider the convex combination

formed by two explanation functions, giand gj. Using con-

vex optimization, we ﬁnd the value on the line segment be-

tween giand gjthat has the minimum complexity; essen-

tially, we iteratively shrink the region. For the region shrink-

ing method, the convex combination formed by giand gjis:

w(gi) + (1 −w)(gj), w ∈[0,1]

For every pair of functions in Gm, we ﬁnd the functions that

produces the minimum complexity in the convex combina-

tion of the functions, producing a new set of candidates G0

gagg is the element in set G0

mwith minimal complexity af-

ter Kiterations. In each iteration, a function is chosen if it

has the minimum complexity of all the functions in a convex

combination. Thus, the minimum complexity of the set G0

decreases or remains constant with each iteration.

6 Lowering Sensitivity Via Aggregation

To construct an aggregate explanation function gthat mini-

mizes sensitivity, we would need to ensure that a test point’s

explanation is a function of the explanations of its nearest

neighbors under ρ. This is a natural analog for how hu-

mans reason: we use past similar events (training data) and

facts about the present (individual features) to make decisions

[Bhatt et al., 2019b]. We now contribute a new explanation

function gAVA that combines the Shapley value explanations

of a test point’s nearest neighbors to explain the test point.

Figure 1: Visual examples of the two complexity lowering aggrega-

tion algorithms: gradient-descent style (a) and region shrinking (b)

methods using explanation functions g1,g2,g3

6.1 Shapley Value Review

Borrowing from game theory, Shapley values denote the

marginal contributions of a player to the payoff of a coali-

tional game. Let Tbe the number of players and let v:

2T→Rbe the characteristic function, where v(S)denotes

the worth (contribution) of the players in S⊆T. The Shapley

value of player i’s contribution (averaging player i’s marginal

contributions to all possible subsets S) is:

φi(v) = 1

|T|X

S⊆T\{i}T−1

S−1

(v(S∪ {i})−v(S))

Let Φ∈RTbe a Shapley value contribution vector for all

players in the game, where φi(v)is the ith element of Φ.

6.2 Shapley Values as Explanations

In the feature importance literature, we formulate a similar

problem to where the game’s payoff is the predictor’s output

y=f(x), the players are the dfeatures of x, and the φi

values represent the contribution of xito the game f(x). Let

the characteristic function be the importance score of a subset

of features xs, where EY[·|x]is an expectation over Pf(·|x):

vx(S) = EY−log 1

Pf(Y|xs)

x

This characteristic function denotes the negative of the ex-

pected number of bits required to encode the predictor’s out-

put based on the features in a subset S[Chen et al., 2019].

Shapley value contributions can be approximated via Monte

Carlo sampling [ˇ

Strumbelj and Kononenko, 2014]or via

weighted least squares [Lundberg and Lee, 2017].

6.3 Aggregate Valuation of Antecedents

We now explore how to explain a test point in terms of the

Shapley value explanations of its neighbors. Termed Aggre-

gate Valuation of Antecedents (AVA), we derive an explana-

tion function that explains a data point in terms of the expla-

nations of its neighbors. We do the following: suppose we

want to ﬁnd an explanation function gAVA(f,xtest)for a point

of interest xtest. First we ﬁnd the knearest neighbors of xtest

under ρdenoted by Nk(xtest,D).

Nk(xtest,D) = arg min

N ⊂D,|N |=kX

z∈N

ρ(xtest,z)

We deﬁne gAVA(f,xtest) = Φxtest as the explanation function

where:

gAVA(f,xtest)i=φi(vAVA) = X

z∈Nk(xtest)

gSHAP(f,z)i

ρ(xtest,z)

z∈Nk(xtest)

φi(vz)

ρ(xtest,z)

In essence, we weight each neighbor’s Shapley value con-

tribution by the inverse distance from the neighbor to the test

point. AVA is closely related to bootstrap aggregation from

classical statistics, as we take an average of model outputs to

improve explanation function stability.

Theorem 5. gAVA(f,xtest )is a Shapley value explanation.

Proof. We want to show that gAVA (f,xtest )=Φxtest is indeed

a vector of Shapley values. Let gSHAP (f,z) = Φzbe the

vector of Shapley value contributions for a point z∈ Nk. By

[Lundberg and Lee, 2017], we know gSHAP(f,z)i=φi(vz)

is a unique Shapley value for the characteristic function vz.

By linearity of Shapley values [Shapley, 1953], we know that:

φi(vz1+vz2) = φi(vz1) + φi(vz2)(3)

This means that the Φz1+ Φz2will yield a unique Shapley

value contribution vector for the characteristic function vz1+

vz2. By linearity (or additivity), we know for any scalar α:

αφi(vz) = φi(αvz)(4)

This means αΦzwill yield a unique Shapley value contribu-

tion vector for the characteristic function αvz. Now deﬁne:

Φxtest =X

z∈Nk(xtest)

Φz

ρ(xtest,z)

We can conclude that Φxtest is a vector of Shapley values.

While [Sundararajan et al., 2017]takes a path integral

from a ﬁxed reference baseline ¯

xand [Lundberg and Lee,

2017]only considers attribution along the straight line path

between ¯

xand xtest, AVA takes a weighted average of attri-

butions along paths from training points in Nkto xtest. AVA

can similarly be thought of as a convex combination of ex-

planation functions where the explanation functions are the

explanations of the neighbors of xtest and the weights are

ρ(xtest,z)−1. Though the weights are guaranteed to be non-

negative, we normalize the weights to sum to 1and edit the

AVA formulation to be: gAVA (f,xtest ) = ρtotΦxtest where

ρtot =Pz∈Nk(xtest)ρ(xtest ,z)−1. Notice this formulation is

a speciﬁc convex combination as described before; therefore,

AVA will result in a lower sensitivity than gSHAP(f,x)alone.

6.4 Medical Connection

Similar to how a model uses input features to reach an out-

put, medical professionals learn how to proactively search

for risk predictors in a patient. Medical professionals not

only use patient attributes (e.g., vital signs, personal infor-

mation) to make a diagnosis but also leverage experiences

with past patients; for example, if a doctor treated a rare dis-

ease over a decade ago, then that experience can be crucial

when attributes alone are uninformative about how to diag-

nose [Goold and Lipkin Jr, 1999]. This is the analogous to

“close” training points affecting a predictor’s output. AVA

combines the attributions of past training points (past pa-

tients) to explain an unseen test point (current patient). When

using the MIMIC dataset [Johnson et al., 2016], AVA models

the aforementioned intuition.

7 Experiments

We now report some empirical results. We evaluate mod-

els trained on the following datasets: Adult, Iris [Dua and

Graff, 2017], MIMIC [Johnson et al., 2016], and MNIST [Le-

Cun et al., 1998]. We use the following explanation func-

tions: SHAP [Lundberg and Lee, 2017], Shapley Sampling

(SS) [ˇ

Strumbelj and Kononenko, 2014], Gradient Saliency

(Grad) [Baehrens et al., 2010], Grad*Input (G*I) [Shrikumar

et al., 2017], Integrated Gradients (IG) [Sundararajan et al.,

2017], and DeepLift (DL) [Shrikumar et al., 2017].

For all tabular datasets, we train a multilayer perceptron

(MLP) with leaky-ReLU activation using the ADAM opti-

mizer. For Iris [Dua and Graff, 2017], we train our model

to 96% test accuracy. For Adult [Dua and Graff, 2017], our

model has 82% test accuracy. As motivated in Section 6.4,

we use MIMIC (Medical Information Mart for Intensive Care

III) [Johnson et al., 2016]. We extract seventeen real-valued

features deemed critical, per [Purushotham et al., 2018], for

sepsis prediction. Our model gets 91% test accuracy on the

task. For MNIST [LeCun et al., 1998], our model is a convo-

lutional neural network and has 90% test accuracy.

For experiments with a baseline ¯

x, zero baseline implies

that we set features to 0and average baseline uses the average

feature value in D. Before doing aggregation, we unit norm

all explanations. For the complexity criterion, we take the

positive 1norm. We set D=2and ρ=∞.

7.1 Faithfulness µF

In Table 2, we report results for faithfulness for various ex-

planation functions. When evaluating, we take the average

of multiple runs where, in each run, we see at least 50 dat-

apoints; for each datapoint, we randomly select |S|features

and replace them with baseline values. We then calculate the

Pearson’s correlation coefﬁcient between the predicted logits

of each modiﬁed test point and the average explanation at-

tribution for only the subset of features. We notice that, as

subset size increases, faithfulness increases until the subset is

large enough to contain all informative features. We ﬁnd that

Shapley values, approximated with weighted least squares,

are the most faithful explanation function for smaller datasets.

7.2 Max and Avg Sensitivity µMand µA

In Table 3, we report the max and average sensitivities for

various explanation functions. To evaluate the sensitivity cri-

terion, we sample a set of test points from Dand an additional

larger set of training points. We then ﬁnd the training points

that fall within a radius rneighborhood of each test point and

ﬁnd the distance between each nearby training point explana-

tion and the test point explanation to get a mean and max. We

average over ten random runs of this procedure. Sensitivity is

INP UT BE ST ( DE EP LIFT ) CONVEX GRADIENT-DE SC ENT REGION-SHRINKING

µC= 3.688 µC= 3.685 µC= 3.575 µC= 3.208

Table 1: Qualitative example of aggregation to lower complexity (µC): We show that it is possible to lower complexity slightly with both of

our approaches; note that achieving lowest complexity on an image would imply that all attribution is placed on a single pixel.

METHOD ADU LT IRIS MIMIC MIMIC

SUBSET 2 2 10 20

SHAP (62,60) (6 7,6 8) (31, 36) (37, 47)

SS (46, 27) (32, 36) (59, 58) (38, 45)

GRA D (30, 53) (14, 16) (37, 41) (28, 63)

G*I (38, 39) (27, 30) (54, 48) (59, 43)

IG (47, 33) (60, 57) (66, 51) (68, 5 1)

DL (58, 43) (46, 48) (84, 54) (43, 45)

Table 2: Faithfulness µFaveraged over a test set: (Zero Baseline,

Training Average Baseline). Exact quantities can be obtained by

dividing table entries by 102

METHOD ADU LT IRIS MIMIC

RADIUS 2 0.2 4

SHAP (60, 54) (310, 287) (6,5)

SS (191, 168) (477 , 345) (83, 81)

GRA D (60, 50) (68,66 ) (28, 28)

G*I (86, 71) (298, 279) (77, 50)

IG (19,17 ) (495, 462) (19, 15)

DL (74, 74) (850, 820) (135, 111)

Table 3: Sensitivity: (Max µM, Avg µA). Exact quantities can be

obtained by dividing table entries by 103

highly dependent on the dimensionality dand on the radius r.

We ﬁnd that as sensitivity decreases as rincreases. Empiri-

cally, for MIMIC, Shapley values approximated by weighted

least squares (SHAP) are the least sensitive.

7.3 MNIST Complexity µC

In Table 1, we provide a qualitative example for the gradi-

ent descent-style and region-shrinking methods for lowering

complexity of explanations from a model trained on MNIST.

We show an example with images since it illustrates the no-

tion of lower complexity well; however, other data types (tab-

ular) might be better suited for complexity optimization.

7.4 AVA

Our empirical ﬁndings support use of an AVA explanation if

low sensitivity is desired. [Ghorbani et al., 2019]note that

perturbation-based explanations (like gSHAP) are less sensi-

tive than their gradient-based counterparts. In Table 4, we

show that AVA explanations not only have lower sensitivities

in all experiments but also have less complex explanations

(depending on the radius rand number of features d). After

METHOD ADU LT IRIS MIMIC

µA(f,gSHAP)0.16 ±0.11 0.22 ±0.25 0.47 ±0.12

µA(f,gAVA )0.07 ±0.07 0.13 ±0.18 0.31 ±0.13

µM(f,gSHAP)0.68 ±0.13 1.20 ±0.36 0.83 ±0.17

µM(f,gAVA )0.52 ±0.11 1.18 ±0.28 0.72 ±0.22

µC(f,gSHAP)1.94 ±0.26 1.36 ±0.36 2.33 ±0.23

µC(f,gAVA )1.93 ±0.24 1.24 ±0.32 2.61 ±0.29

Table 4: AVA lowers the sensitivity of Shapley value explanations

across all datasets. When dis small (fewer features), AVA explana-

tions are slightly less complex.

ﬁnding the average distance between pairs of points, we use

r= 1 for Adult, r= 0.3for Iris, and r= 10 for MIMIC.

8 Conclusion

Borrowing from earlier work in social science and the philos-

ophy of science, we codify low sensitivity, high faithfulness,

and low complexity as three desirable properties of explana-

tion functions. We deﬁne these three properties for feature-

based explanation functions, develop an aggregation scheme

for learning combinations of various explanation functions,

and devise schemes to learn explanations with lower com-

plexity (iterative approaches) and lower sensitivity (AVA).

We hope that this work will provide practitioners with a prin-

cipled way to evaluate feature-based explanations and to learn

an explanation which aggregates and optimizes for criteria

desired by end users. Though we consider one criterion at

a time, future work could further axiomatize our criteria, ex-

plore the interaction between different evaluation criteria, and

devise a multi-objective optimization approach to ﬁnding a

desirable explanation; for example, can we develop a proce-

dure for learning a less sensitive and less complex explanation

function simultaneously?

Acknowledgements

We thank reviewers for their feedback. We thank Pradeep

Ravikumar, John Shi, Brian Davis, Kathleen Ruan, Javier An-

toran, James Allingham, and Adithya Raghuraman for their

comments and help. UB acknowledges support from Deep-

Mind and the Leverhulme Trust via the Leverhulme Cen-

ter for the Future of Intelligence (CFI) and from the Part-

nership on AI. AW acknowledges support from the David

MacKay Newton Research Fellowship at Darwin College,

The Alan Turing Institute under EPSRC grant EP/N510129/1

& TU/B/000074, and the Leverhulme Trust via the CFI.

References

[Adebayo et al., 2018]Julius Adebayo, Justin Gilmer,

Michael Muelly, Ian Goodfellow, Moritz Hardt, and

Been Kim. Sanity checks for saliency maps. In Ad-

vances in Neural Information Processing Systems, pages

9505–9515, 2018.

[Ancona et al., 2018]Marco Ancona, Enea Ceolini, Cengiz

Oztireli, and Markus Gross. Towards better understand-

ing of gradient-based attribution methods for deep neural

networks. In 6th International Conference on Learning

Representations (ICLR 2018), 2018.

[Baehrens et al., 2010]David Baehrens, Timon Schroeter,

Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, and

Klaus-Robert Muller. How to explain individual classiﬁ-

cation decisions. JMLR, 11(Jun):1803–1831, 2010.

[Batterman and Rice, 2014]Robert W Batterman and

Collin C Rice. Minimal model explanations. Philosophy

of Science, 81(3):349–376, 2014.

[Berger, 2013]James O Berger. Statistical decision theory

and Bayesian analysis. Springer Science & Business Me-

dia, 2013.

[Bhatt et al., 2019a]Umang Bhatt, Pradeep Ravikumar,

et al. Building human-machine trust via interpretability.

In Proceedings of the AAAI Conference on Artiﬁcial Intel-

ligence, volume 33, pages 9919–9920, 2019.

[Bhatt et al., 2019b]Umang Bhatt, Pradeep Ravikumar, and

Jos´

e M. F. Moura. Towards aggregating weighted feature

attributions. arXiv:1901.10040, 2019.

[Bhatt et al., 2020]Umang Bhatt, Alice Xiang, Shubham

Sharma, Adrian Weller, Ankur Taly, Yunhan Jia, Joydeep

Ghosh, Ruchir Puri, Jos´

e M. F. Moura, and Peter Ecker-

sley. Explainable machine learning in deployment. ACM

Conference on Fairness, Accountability, and Transparency

(FAT*), 2020.

[Bylinskii et al., 2018]Zoya Bylinskii, Tilke Judd, Aude

Oliva, Antonio Torralba, and Fr´

edo Durand. What do dif-

ferent evaluation metrics tell us about saliency models?

IEEE transactions on pattern analysis and machine intel-

ligence, 41(3):740–757, 2018.

[Camburu et al., 2019]Oana-Maria Camburu, Eleonora

Giunchiglia, Jakob Foerster, Thomas Lukasiewicz, and

Phil Blunsom. Can I trust the explainer? Verifying

post-hoc explanatory methods. arXiv:1910.02065, 2019.

[Carter et al., 2019]Brandon Carter, Jonas Mueller, Sid-

dhartha Jain, and David Gifford. What made you do this?

understanding black-box decisions with sufﬁcient input

subsets. In The 22nd International Conference on Arti-

ﬁcial Intelligence and Statistics, pages 567–576, 2019.

[Chang et al., 2019]Chun-Hao Chang, Elliot Creager, Anna

Goldenberg, and David Duvenaud. Explaining image clas-

siﬁers by counterfactual generation. In International Con-

ference on Learning Representations, 2019.

[Chen et al., 2018]Jianbo Chen, Le Song, Martin Wain-

wright, and Michael Jordan. Learning to explain: An

information-theoretic perspective on model interpretation.

In Jennifer Dy and Andreas Krause, editors, Proceedings

of the 35th International Conference on Machine Learn-

ing, volume 80 of Proceedings of Machine Learning Re-

search, pages 883–892, Stockholmsm¨

assan, Stockholm

Sweden, 10–15 Jul 2018. PMLR.

[Chen et al., 2019]Jianbo Chen, Le Song, Martin J Wain-

wright, and Michael I Jordan. L-shapley and C-shapley:

Efﬁcient model interpretation for structured data. Interna-

tional Conference on Learning Representations, 2019.

[Davis et al., 2020]B. Davis, U. Bhatt, K. Bhardwaj,

R. Marculescu, and J. M. F. Moura. On network sci-

ence and mutual information for explaining deep neural

networks. In ICASSP 2020 - 2020 IEEE International

Conference on Acoustics, Speech and Signal Processing

(ICASSP), pages 8399–8403, 2020.

[Dua and Graff, 2017]Dheeru Dua and Casey Graff. UCI

machine learning repository, 2017.

[Ghorbani et al., 2019]Amirata Ghorbani, Abubakar Abid,

and James Zou. Interpretation of neural networks is frag-

ile. In Proceedings of the AAAI Conference on Artiﬁcial

Intelligence, volume 33, pages 3681–3688, 2019.

[Gilpin et al., 2018]Leilani H Gilpin, David Bau, Ben Z

Yuan, Ayesha Bajwa, Michael Specter, and Lalana Kagal.

Explaining explanations: An overview of interpretability

of machine learning. In 2018 IEEE 5th International Con-

ference on data science and advanced analytics (DSAA),

pages 80–89. IEEE, 2018.

[Goold and Lipkin Jr, 1999]Susan Dorr Goold and Mack

Lipkin Jr. The doctor–patient relationship: challenges,

opportunities, and strategies. Journal of general internal

medicine, 14(Suppl 1):S26, 1999.

[Grabska-Barwi´

nska, 2020]Agnieszka Grabska-Barwi´

nska.

Measuring and improving the quality of visual explana-

tions. arXiv preprint arXiv:2003.08774, 2020.

[Hazard et al., 2019]Christopher J Hazard, Christopher

Fusting, Michael Resnick, Michael Auerbach, Michael

Meehan, and Valeri Korobov. Natively interpretable

machine learning and artiﬁcial intelligence: Prelim-

inary results and future directions. arXiv preprint

arXiv:1901.00246, 2019.

[Hind et al., 2019]Michael Hind, Dennis Wei, Murray

Campbell, Noel CF Codella, Amit Dhurandhar, Aleksan-

dra Mojsilovi´

c, Karthikeyan Natesan Ramamurthy, and

Kush R Varshney. Ted: Teaching ai to explain its deci-

sions. In Proceedings of the 2019 AAAI/ACM Conference

on AI, Ethics, and Society, pages 123–129, 2019.

[Honegger, 2018]Milo Honegger. Shedding Light on Black

Box Algorithms. Master’s thesis, Karlsruhe Institute of

Technology, Germany, 2018.

[Hooker et al., 2019]Sara Hooker, Dumitru Erhan, Pieter-

Jan Kindermans, and Been Kim. A benchmark for inter-

pretability methods in deep neural networks. In Advances

in Neural Information Processing Systems, pages 9734–

9745, 2019.

[Johnson et al., 2016]Alistair EW Johnson, Tom J Pollard,

Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad

Ghassemi, Benjamin Moody, Peter Szolovits, Leo An-

thony Celi, and Roger G Mark. Mimic-iii, a freely ac-

cessible critical care database. Scientiﬁc Data, 2016.

[Kindermans et al., 2019]Pieter-Jan Kindermans, Sara

Hooker, Julius Adebayo, Maximilian Alber, Kristof T

Sch¨

utt, Sven D¨

ahne, Dumitru Erhan, and Been Kim. The

(un) reliability of saliency methods. In Explainable AI:

Interpreting, Explaining and Visualizing Deep Learning,

pages 267–280. Springer, 2019.

[Lage et al., 2019]Isaac Lage, Emily Chen, Jeffrey He,

Menaka Narayanan, Been Kim, Samuel J Gershman, and

Finale Doshi-Velez. Human evaluation of models built for

interpretability. In Proceedings of the AAAI Conference

on Human Computation and Crowdsourcing, volume 7,

pages 59–67, 2019.

[LeCun et al., 1998]Yann LeCun, L ´

eon Bottou, Yoshua

Bengio, and Patrick Haffner. Gradient-based learning ap-

plied to document recognition. Proceedings of the IEEE,

86(11):2278–2324, 1998.

[Lipton, 2003]Peter Lipton. Inference to the best explana-

tion. Routledge, 2003.

[Lundberg and Lee, 2017]Scott M Lundberg and Su-In Lee.

A uniﬁed approach to interpreting model predictions. In

Advances in Neural Information Processing Systems 30

(NeurIPS 2017), pages 4765–4774, 2017.

[Melis and Jaakkola, 2018]David Alvarez Melis and Tommi

Jaakkola. Towards robust interpretability with self-

explaining neural networks. In Advances in Neural Infor-

mation Processing Systems (NeurIPS 2018), 2018.

[Montavon et al., 2018]Gr´

egoire Montavon, Wojciech

Samek, and Klaus-Robert M ¨

uller. Methods for inter-

preting and understanding deep neural networks. Digital

Signal Processing, 73:1–15, 2018.

[Osman et al., 2020]Ahmed Osman, Leila Arras, and Woj-

ciech Samek. Towards ground truth evaluation of visual

explanations. arXiv preprint arXiv:2003.07258, 2020.

[Plumb et al., 2018]Gregory Plumb, Denali Molitor, and

Ameet S Talwalkar. Model agnostic supervised local ex-

planations. In Advances in Neural Information Processing

Systems, pages 2515–2524, 2018.

[Poursabzi-Sangdeh et al., 2018]Forough Poursabzi-

Sangdeh, Daniel G Goldstein, Jake M Hofman, Jen-

nifer Wortman Vaughan, and Hanna Wallach. Manipulat-

ing and measuring model interpretability. arXiv preprint

arXiv:1802.07810, 2018.

[Purushotham et al., 2018]Sanjay Purushotham, Chuizheng

Meng, Zhengping Che, and Yan Liu. Benchmarking deep

learning models on large healthcare datasets. Journal of

Biomedical Informatics, 83:112–134, 2018.

[Ribeiro et al., 2016]Marco Tulio Ribeiro, Sameer Singh,

and Carlos Guestrin. Why should i trust you?: Explain-

ing the predictions of any classiﬁer. In Proceedings of the

22nd ACM SIGKDD International Conference on Knowl-

edge Discovery and Data Mining. ACM, 2016.

[Rieger and Hansen, 2020]Laura Rieger and Lars Kai

Hansen. Irof: a low resource evaluation metric for expla-

nation methods. arXiv preprint arXiv:2003.08747, 2020.

[Ruben, 2015]David-Hillel Ruben. Explaining explanation.

Routledge, 2015.

[Samek et al., 2016]Wojciech Samek, Alexander Binder,

Gr´

egoire Montavon, Sebastian Lapuschkin, and Klaus-

Robert M¨

uller. Evaluating the visualization of what a deep

neural network has learned. IEEE transactions on neural

networks and learning systems, 28(11):2660–2673, 2016.

[Shapley, 1953]Lloyd S Shapley. A value for n-person

games. In Contributions to the Theory of Games II, pages

307–317, 1953.

[Shrikumar et al., 2017]Avanti Shrikumar, Peyton Green-

side, and Anshul Kundaje. Learning important features

through propagating activation differences. In Proceed-

ings of the 34th International Conference on Machine

Learning-Volume 70 (ICML 2017), pages 3145–3153.

Journal of Machine Learning Research, 2017.

[ˇ

Strumbelj and Kononenko, 2014]Erik ˇ

Strumbelj and Igor

Kononenko. Explaining prediction models and individ-

ual predictions with feature contributions. Knowledge and

Information Systems, 41(3):647–665, 2014.

[Sundararajan et al., 2017]Mukund Sundararajan, Ankur

Taly, and Qiqi Yan. Axiomatic attribution for deep net-

works. In Proceedings of the 34th International Confer-

ence on Machine Learning-Volume 70 (ICML 2017), pages

3319–3328. Journal of Machine Learning Research, 2017.

[Wang et al., 2020]Zifan Wang, Piotr Mardziel, Anupam

Datta, and Matt Fredrikson. Interpreting interpretations:

Organizing attribution methods by criteria. arXiv preprint

arXiv:2002.07985, 2020.

[Warnecke et al., 2019]Alexander Warnecke, Daniel Arp,

Christian Wressnegger, and Konrad Rieck. Evaluating ex-

planation methods for deep learning in security. arXiv

preprint arXiv:1906.02108, 2019.

[Yang and Kim, 2019]Mengjiao Yang and Been Kim. BIM:

Towards quantitative evaluation of interpretability meth-

ods with ground truth. arXiv:1907.09701, 2019.

[Yang et al., 2019]Fan Yang, Mengnan Du, and Xia

Hu. Evaluating explanation without ground truth

in interpretable machine learning. arXiv preprint

arXiv:1907.06831, 2019.

[Yeh et al., 2019]Chih-Kuan Yeh, Cheng-Yu Hsieh, Arun

Suggala, David I Inouye, and Pradeep K Ravikumar. On

the (in) ﬁdelity and sensitivity of explanations. In Ad-

vances in Neural Information Processing Systems, pages

10965–10976, 2019.

[Zhang et al., 2019]Hao Zhang, Jiayi Chen, Haotian Xue,

and Quanshi Zhang. Towards a uniﬁed evaluation of ex-

planation methods without ground truth. arXiv preprint

arXiv:1911.09017, 2019.

A Additional Evaluation Criteria

In addition to the aforementioned three criteria, there are

many other desirable criteria for a g. To assist practitioners,

we now collect and list these additional quantitative evalua-

tion criteria for feature-level explanations. It is possible to

evaluate all criteria for both perturbation-based explanations

[ˇ

Strumbelj and Kononenko, 2014; Lundberg and Lee, 2017]

and gradient-based explanations [Sundararajan et al., 2017;

Shrikumar et al., 2017]. Note we omit evaluation criteria

that assume access to ground-truth explanations for train-

ing points; for a thorough treatment on this topic, see [Hind

et al., 2019; Osman et al., 2020]. We do not delve into

human-centered evaluation of explanation functions either;

see [Gilpin et al., 2018; Poursabzi-Sangdeh et al., 2018;

Yang et al., 2019]for detailed discussions.

Predictability of Explanations

We would want to ensure that explanations from gare pre-

dictable. As such, g(f,x)ought not vary over function calls.

[Honegger, 2018]notes that identical inputs should give the

identical explanations.

Deﬁnition 5 (Identity).Given a predictor f, an explanation

function g, and distance metrics Dand ρ, we deﬁne the iden-

tity criterion for gon Das:

µIDENTITY(f,g) = Ex∈DxD(g(f,x),g(f,x))

=Ex∼Dx||g(f,x)−g(f,x)||0

Note the above two are equivalent and we take the 0norm

of the difference between two separate calls to gwith the

same input x. The identity criterion favors non-stochastic ex-

planation functions. We would want to ensure that any non-

identical inputs should have non-identical explanations.

Deﬁnition 6 (Separability).Given a predictor f, an explana-

tion function g, and distance metrics Dand ρ, we deﬁne the

separability of gon Das:

µSEP(f,g) = Ex,z∈Dx,x6=zD(g(f,x),g(f,z))

=Ex,z∈Dx,x6=z||g(f,x)−g(f,z)||0

We would also want to know how surprising an explana-

tion g(f,x)is compared to explanations for training data.

[Hazard et al., 2019]deﬁnes conviction of an input xwith re-

spect to Dxfor k-Nearest Neighbor algorithms; similarly, we

deﬁne the conviction of g(f,x)to explanations of training

points, Dx, using g.

Deﬁnition 7 (Conviction).Given a predictor f, an explana-

tion function g, a probability distribution over explanations

Pφ(·), and a data point x, we deﬁne the conviction of gat x

for Das:

µCON(f,g,Pφ;x) = z∼Dx[I(g(f,z))]

I(g(f,x))

where I(g(f,x)) = −ln(Pφ(g(f,x)))

µCON = 0 means that g(f,x)is surprising. As µCON →

∞,g(x)contains an expected amount of surprisal and

can reasonably occur. We desire a higher µCON, implying

that gbehaves predictably. By changing the distribution

to Pφ(·|y=f(x)), the numerator to conditional entropy

where f(z) = f(x), and self-information to I(g(f,x)) =

−ln(Pφ(g(f,x)|y=f(x))), we deﬁne the conditional con-

viction of g(f,x)to explanations of the same predicted class.

Other techniques have also argued that g(f,x)should re-

cover the output of the original predictor, f(x). Deemed

compatibility, this criterion attempts to use gas a simple

proxy for reproducing the outputs of the complex f.

Deﬁnition 8 (Compatibility).Given a predictor fand an ex-

planation function g, we deﬁne the completeness of gfor a

dataset Das:

µCOM(f,g) = 1

x∈Dx d

i=1

g(f,x)i!−f(x)

The closer µCOM is to 0, the more compatible the explana-

tion function is; that is, the explanation function recovers the

complex model’s outputs well. This criterion is related to the

the completeness axiom of some explanation functions [Sun-

dararajan et al., 2017]. An explanation functions can be built

to be compatible with the original model (or complete with

respect to f). This is also related to the notion of post-hoc

accuracy discussed in [Chen et al., 2018].

Importance of Explanations

Not only do we want to ensure that the gfaithfully identiﬁes

the most important features, but we also want to understand

how well fperforms when xsis unobserved (or set to a base-

line xs=¯

xs). In particular, we craft Sto contain the indices

for the |S|features with the highest abs(g(f,x)i).

S= arg max

S⊂[d],|S|=kX

i∈S

abs(g(f,x)i)

Therefore, xsis now a sub-vector of the most important fea-

tures according to a speciﬁc g. As done in [Chang et al.,

2019], we deﬁne a score sffor how conﬁdently fpredicts an

output yin terms of log-odds.

sf(y|x) = log(ˆ

Pf(y|x)) −log(1 −ˆ

Pf(y|x))

Deﬁnition 9 (Deletion).Given a predictor f, an explanation

function g, a point of interest x, a predicted output y, and a

subset of important features S, we deﬁne the deletion score

for fat xas:

µDEL(f,g;x, y) = sf(y|x)−sf(y|x[xs=¯

xs])

Deﬁnition 10 (Addition).Given a predictor f, an explana-

tion function g, a point of interest x, a predicted output y,

and a subset of important features S, we deﬁne the addition

score for fat xas:

µADD(f,g;x, y) = sf(y|x[xs=¯

xs])−sf(y|¯

While the deletion score conveys how the log-odds change

when we delete the subset of important features from x, the

addition score tells us how much the log-odds change when

we add the subset to the baseline. Instead of re-scoring (via

change in log-odds) a modiﬁed input like x[xs=¯

xs], we can

retrain the predictor fbased on a dataset of modiﬁed inputs

Dx[xs=¯

xs]. Addition and Deletion are closely related to ex-

planation selectivity, described in [Montavon et al., 2018].

Let f¯

xsdenote the predictor trained on the modiﬁed in-

puts with the most important pixels removed. As in [Hooker

et al., 2019], we deﬁne the ROAR score as the difference in

accuracy between the original predictor and the modiﬁed pre-

dictor. We can also train a predictor where the least important

features (those in xc) are removed. We denote that predictor

to be f¯

xcand deﬁne a KAR score, as proposed in [Hooker et

al., 2019].

Deﬁnition 11 (ROAR).Given a predictor f, an explanation

function g, a modiﬁed predictor f¯

xs, and a subset of impor-

tant features S, we deﬁne the ROAR score for gon a dataset

Das:

µROAR(f,g,f¯

xs) = 1

x∈Dx

[f(x) = y]−[f¯

xs(x) = y]

Deﬁnition 12 (KAR).Given a predictor f, an explanation

function g, a modiﬁed predictor f¯

xc, and a subset of impor-

tant features S, we deﬁne the KAR score for gon a dataset D

as:

µKAR(f,g,f¯

xc) = 1

x∈Dx

[f(x) = y]−[f¯

xc(x) = y]

Other Connections

We can also draw parallels between the three criteria pro-

posed in the main paper and existing criteria in the literature.

Low sensitivity is discussed as stability in [Melis and

Jaakkola, 2018], as explanation continuity in [Montavon et

al., 2018], as sensitivity in [Yeh et al., 2019], and as reliabil-

ity in [Kindermans et al., 2019].

High faithfulness appears as relevance in [Samek et al.,

2016], as gold set in [Ribeiro et al., 2016], as faithfulness in

[Plumb et al., 2018], as sensitivity-n in [Ancona et al., 2018],

and as inﬁdelity in [Yeh et al., 2019].

Low complexity is loosely related to information gain from

[Bylinskii et al., 2018]and to descriptive sparsity from [War-

necke et al., 2019].

Moreover, very recent literature has also tried to develop

various other explanation function criteria: parameter ran-

domization [Adebayo et al., 2018], clustering-based interpre-

tations [Carter et al., 2019], existence of “unexplainable com-

ponents” [Zhang et al., 2019], variants of perturbation tech-

niques [Grabska-Barwi´

nska, 2020], variants of mutual infor-

mation measures [Davis et al., 2020], impact of iterative fea-

ture removal [Rieger and Hansen, 2020], and necessity and

sufﬁciency of attributions [Wang et al., 2020].

B Proofs

For thoroughness, we elaborate on the proofs from the main

paper here.

B.1 Proof of Proposition 1

Proof. Assuming we ﬁx the predictor f, let g(x) = g(f,x)

and let Rrepresent R

ρ(x,z)≤R

for the rest of this proof.

µA(gagg ) = ZD(gagg (x),gagg (z))Px(z)dz

=Z||gagg (x)−gagg (z)||2dz

=Z||wg1(x) + (1 −w)g2(x)−wg1(z)−(1 −w)g2(z)||2dz

=Z||wg1(x)−wg1(z) + (1 −w)g2(x)−(1 −w)g2(z)||2dz

=Z||w(g1(x)−g1(z)) + (1 −w)(g2(x)−g2(z))||2dz

≤Z||w(g1(x)−g1(z))||2+||(1 −w)(g2(x)−g2(z))||2dz

≤Zw||g1(x)−g1(z)||2+ (1 −w)||g2(x)−g2(z)||2dz

≤ZwD(g1(x),g1(z)) + (1 −w)D(g2(x),g2(z))dz

≤wZD(g1(x),g1(z))dz+ (1 −w)ZD(g2(x),g2(z))dz

≤wµA(g1) + (1 −w)µA(g2)

B.2 Proof of Proposition 2

Proof. To prove this, we just need to show that the sum of

the squared distances is minimized by the mean of a set of

explanation vectors:

gagg (f,x) = 1

i=1

gi(f,x)

Recall we have a set of candidate explanation functions Gm=

{g1,...,gm}. Fix a point of interest x. Since dis the 2

distance and p= 2, we deﬁne a loss function as follows:

L(gagg(f,x)) =

i=1

||gagg(f,x)−gi(f,x)||2

We then compute the partial derivatives with respect to each

feature of our aggregate explanation gagg(f,x)j.

∂L

∂gagg(f,x)j

= 2mgagg(f,x)j−2

i=1

gi(f,x)j

gagg(f,x)j=Pm

i=1 gi(f,x)j

gagg(f,x) = 





i=1 gi(f,x)1

i=1 gi(f,x)d





=1

i=1

gi(f,x)

B.3 Proof of Proposition 3

Proof. To prove this, we just need to show that the sum of

the absolute distances is minimized by the median of a set of

explanation vectors:

gagg (f,x) = med{gi(f,x)}

Recall we have a set of candidate explanation functions Gm=

{g1,...,gm}. Fix a point of interest x. Since dis the 1

distance and p= 1, we deﬁne a loss function as follows:

L(gagg(f,x)) =

i=1

|gagg(f,x)−gi(f,x)|

Taking the partial derivative of the above with respect to each

feature of our aggregate explanation gagg(x)jyields.

∂L

∂gagg(f,x)j

i=1

sign(gagg(f,x)j−gi(f,x)j)

Now the above partial derivative only equals zero when the

number of positive and negative items are the same. The me-

dian is the only value where the number of positive items

(those greater than the median) and the number of negative

items (those less than the median) are equal. Thus, the me-

dian value for each feature jwould minimize the sum of ab-

solute deviations loss we crafted above (i.e. gagg (f,x)j=

med{g1(f,x)j,g2(f,x)j,...,gm(f,x)j}.

B.4 Alternative Proof of Theorem 5

Proof. We want to show that gAVA (f,xtest )=Φxtest is indeed

a vector of Shapley values. Let gSHAP (f,z) = Φzbe the

vector of Shapley value contributions for a point z∈ Nk.

By [Lundberg and Lee, 2017], we know that gSHAP(f,z)i=

φi(vz)is a unique Shapley value for the characteristic func-

tion vz. By linearity of Shapley values [Shapley, 1953], we

know that:

φi(vz1+vz2) = φi(vz1) + φi(vz2)(5)

This means that the Φz1+ Φz2will yield a unique Shapley

value contribution vector for the characteristic function vz1+

vz2. By linearity (also called additivity), we also know that,

for any scalar α:

αφi(vz) = φi(αvz)(6)

This means that the αΦzwill yield a unique Shapley value

contribution vector for the characteristic function αvz. Now,

to show Φxtest is a vector of Shapley values, it sufﬁces to show

that any φi(vAVA)∈Φxtest is a Shapley value. As such, we

deﬁne vAVA to be the characteristic function of gAVA(f,x),

where we ﬁnd the average weighted importance score of the

neighbors of xtest.

vAVA(S) = X

z∈Nk(xtest)

vz(S)

ρ(xtest,z)(7)

z∈Nk(xtest)

ρ(xtest,z)EY−log 1

Pf(Y|zs)

z

By Equations 5, 6, and 7, we can see that φi(vAVA )is a Shap-

ley value.

gAVA(f,xtest)i=φi(vAVA)(8)

z∈Nk(xtest)

gSHAP(f,z)i

ρ(xtest,z)

z∈Nk(xtest)

φi(vz)

ρ(xtest,z)

C Details on Lowering Complexity

Given a ﬁxed input xand an explanation function gi, the

complexity can be rewritten as:

µC(f,gi;x) = −

k=1

|gi(f,x)k|

j∈[d]

|gi(f,x)j|ln 





|gi(f,x)k|

j∈[d]

|gi(f,x)j|





This will help us determine how a small perturbation of the

kth component of gi(f,x)will affect the complexity of gi,

which, in turn, will help ﬁnd a lower complexity explanation.

Note gi(f,x)kis the kth component of gi(f,x). The partial

derivative of µC(f,gi;x)with respect to the kth component

of gi(f,x)is:

∂µC(f,gi;x)

∂gi(f,x)k

=−(1 + ln (a)) Pd

l=1

l6=k|gi(f,x)l|

j∈[d]

|gi(f,x)j|)2

l=1

l6=k

(1 + ln (b)) |gi(f,x)l|

j∈[d]

|gi(f,x)j|)2

where a=|gi(f,x)k|

j∈[d]

|gi(f,x)j|and b=|gi(f,x)l|

j∈[d]

|gi(f,x)j|.

We now provide an additional discussion and comparison

of the two algorithms for lowering complexity.

We presented two algorithms for ﬁnding a gagg with lower

complexity: a gradient descent approach (Algorithm 1) and a

region shrinking approach (Algorithm 2). Algorithm 1 relies

on a greedy choice of selecting one of the jdirections to move

in. This algorithm works best for regions that are smooth and

with decreasing complexity around giand gavg. Since Algo-

rithm 1 does not backtrack and moves component-wise, it can

avoid areas of higher complexity, but can take a sub-optimal

step. For example, consider when d= 2. During a walk,

Algorithm 1 may start at gi, move in the xdirection, but then

get stuck as complexity in ydirection increases. However,

had we moved in the ydirection ﬁrst and then in the xdi-

rection, then we may have found a minimum. The choice of

component plagues this approach. On the other hand, Algo-

rithm 2 solves the issue of getting stuck because of regions

of high complexity present in Algorithm 1. Since Algorithm

2 shrinks the region by choosing points in the convex com-

bination, it a can avoid the areas of high complexity. Since

Algorithm 2 uses the line segments between the points cho-

sen, it may be difﬁcult to obtain the global minima, which

Algorithm 1 Gradient-Descent Style Approach to ﬁnding

gagg(f,x)with lower complexity

Require: α,gi(f,x), i = 1, . . . , m, ﬁxed x

Calculate the complexity of each gi(f,x)

for i= 1, . . . , m do

Egi(x)←µC(f,gi;x)←

−Pd

k=1

|gi(f,x)k|

j∈[d]

|gi(f,x)j|ln |gi(f,x)k|

j∈[d]

|gi(f,x)j|!

end for

gavg(f,x)←1

mPm

i=1 gi(f,x)

for i= 1, . . . , m do

Move in the direction of gavg(f,x)from gi(f,x)as

long as the complexity decreases

ti←gi(f,x)

while Complexity of tiis decreasing and ti6=

gavg(f,x)do

for j= 1, . . . , d do

Calculate ∂Eti

∂tij

if Complexity decreases by moving in the jdirec-

tion towards gavg(f,x)then

Update tij

tij ←tij +α∂Eti

∂tij

end if

end for

end while

Move in the direction of gi(f,x)from gavg(f,x)as

long as the complexity decreases

qi←gavg(f,x)

while Complexity of qiis decreasing and qi6=gi(x)

for j= 1, . . . , d do

Calculate ∂Eqi

∂qij

if Complexity decreases by moving in the jdirec-

tion towards gi(x)then

Update qij

qij ←qij +α∂Eqi

∂qij

end if

end for

end while

Take the ti,qithat minimizes the complexity

bi=min

x={qi,ti}Ex

end for

Take the bithat minimizes the complexity

gagg(f,x) = min

Ebi

Algorithm 2 Region Shrinking Approach to ﬁnding

gagg(f,x)with lower complexity

Require: gi(f,x), i = 1, . . . , m, ﬁxed x

t←0

Add all the giinto set S

S← {gi(f,x), i = 1, . . . , m}

repeat

Repeat K times

Initialize S0

S0← ∅

for every 2 points in S: P1, P2do

Find point Pwith the minimum entropy in the convex

combination of P1, P2

Add point Pto S0

end for

Update values

Choose the Nminimum entropy points in S0to form S

t←t+ 1

until t=K

Take the element in set Sthat minimizes the entropy

gagg(f,x) = min

k∈SEk

may not occur on the line segment. A combination of the

two approaches can be used. First, Algorithm 2 can be used

to shrink the region being considered into a set, S, of points

with low complexity. This can avoid getting stuck in areas of

high complexity, like in Algorithm 1. Then, Algorithm 1 can

be used to move around the points in set Sin order to ﬁnd

the global minima that may not occur on the line segments

considered in Algorithm 2. It can reﬁne the points in set S

to obtain a lower complexity. In sum, we can shrink the re-

gion considered into several candidate sets and then reﬁne the

points in each set by perturbing and performing greedy walks

around them to ﬁnd gagg with a low complexity.

D Experimental Setup

We provide additional details on the datasets used and their

respective models from our experiments.

•Iris [Dua and Graff, 2017]: The iris dataset consists of

150 datapoints: 50 per class and 4 features per datapoint.

We use a one layer multilayer perceptron trained to 96%

accuracy as our f.

•Adult [Dua and Graff, 2017]: Each of the 48842 data-

points has 38 features and falls in one of two classes.

Note we label encode categorical attributes. We use a

one layer MLP (40 hidden nodes with leaky-relu activa-

tion) trained to an accuracy of 82%.

•Mimic-III [Johnson et al., 2016]: The MIMIC-III (Med-

ical Information Mart for Intensive Care III) is a large

electronic health record dataset compromised of health

related data of over 40,000 patients who were admit-

ted to the the critical care units of Beth Israel Dea-

coness Medical Center between the years 2001 and

2012. MIMIC-III consists of demographics, vital sign

measurements, lab test results, medications, procedures,

caregiver notes, imaging reports, and mortality of the

ICU patients. Using MIMIC-III dataset, we extracted

seventeen real-valued features deemed critical in the

sepsis diagnosis task as per [Purushotham et al., 2018].

These are the processed features we extracted for ev-

ery sepsis diagnosis (a binary variable indicating the

presence of sepsis): Glasgow Coma Scale, Systolic

Blood Pressure, Heart Rate, Body Temperature, Pao2 /

Fio2 ratio, Urine Output, Serum Urea Nitrogen Level,

White Blood Cells Count, Serum Bicarbonate Level,

Sodium Level, Potassium Level, Bilirubin Level, Age,

Acquired Immunodeﬁciency Syndrome, Hematologic

Malignancy, Metastatic Cancer, Admission Type. We

used two layers of 16 hidden nodes each and leaky-relu

activation to get an accuracy of 91% on the sepsis pre-

diction task.

•MNIST with CNN [LeCun et al., 1998]: We use a CNN

trained to 90% accuracy with the following architecture:

one layer with 32 5 ×5ﬁlters and ReLU activation; max

pooling layer with a 2×2ﬁlter and stride of 2; convo-

lutional layer with 64 5 ×5ﬁlters and ReLU activation;

max pooling layer with a 2×2ﬁlter and stride of 2;

ﬁnal dense layer with 10 output neurons. We used the

MNIST dataset with 60,000 28x28 grayscale images of

the 10 digits, along with a test set of 10,000 images.

Note that we ﬁx a dataset-model pairing for all experiments.

In practice, when calculating average sensitivity, we use the

following formulation:

µA(f,g,x) = 1

|Nr|X

z∈Nr

D(g(f,x),g(f,z))

ρ(x,z)

Effectively, we want to ensure that the distance between an

explanation of xand an explanation of z, a point in the neigh-

borhood of x, is proportional to the distance between xand

z. Some recent work has shown that average sensitivity can

be lowered with simple smoothing tricks to explanation func-

tions or with adversarial training of the predictor itself.

ResearchGate has not been able to resolve any citations for this publication.

ResearchGate has not been able to resolve any references for this publication.

Evaluating and Aggregating Feature-based Model Explanations

Abstract

Recommended publications

Evaluating and Aggregating Feature-based Model Explanations

Towards Aggregating Weighted Feature Attributions

Building Human-Machine Trust via Interpretability

On the Utility of Prediction Sets in Human-AI Teams