ArticlePDF Available

FrankenSplit: Efficient Neural Feature Compression With Shallow Variational Bottleneck Injection for Mobile Edge Computing

January 2024
IEEE Transactions on Mobile Computing PP(99):1-17

January 2024
PP(99):1-17

DOI:10.1109/TMC.2024.3381952

License
CC BY 4.0

Authors:

Alireza Furutanpey

TU Wien

Schahram Dustdar

TU Wien

The rise of mobile AI accelerators allows latency-sensitive applications to execute lightweight Deep Neural Networks (DNNs) on the client side. However, critical applications require powerful models that edge devices cannot host and must therefore offload requests, where the high-dimensional data will compete for limited bandwidth. Split Computing (SC) alleviates resource inefficiency by partitioning DNN layers across devices, but current methods are overly specific and only marginally reduce bandwidth consumption. This work proposes shifting away from focusing on executing shallow layers of partitioned DNNs. Instead, it advocates concentrating the local resources on variational compression optimized for machine interpretability. We introduce a novel framework for resource-conscious compression models and extensively evaluate our method in an environment reflecting the asymmetric resource distribution between edge devices and servers. Our method achieves 60% lower bitrate than a state-of-the-art SC method without decreasing accuracy and is up to 16x faster than offloading with existing codec standards.

Available via license: CC BY 4.0

Content may be subject to copyright.

FrankenSplit: Efﬁcient Neural Feature

Compression with Shallow Variational

Bottleneck Injection for Mobile Edge Computing

Alireza Furtuanpey , Philipp Raith , Schahram Dustdar , Fellow, IEEE

Abstract—The rise of mobile AI accelerators allows latency-sensitive applications to execute lightweight Deep Neural Networks

(DNNs) on the client side. However, critical applications require powerful models that edge devices cannot host and must therefore

ofﬂoad requests, where the high-dimensional data will compete for limited bandwidth. Split Computing (SC) alleviates resource

inefﬁciency by partitioning DNN layers across devices, but current methods are overly speciﬁc and only marginally reduce bandwidth

consumption. This work proposes shifting away from focusing on executing shallow layers of partitioned DNNs. Instead, it advocates

concentrating the local resources on variational compression optimized for machine interpretability. We introduce a novel framework for

resource-conscious compression models and extensively evaluate our method in an environment reﬂecting the asymmetric resource

distribution between edge devices and servers. Our method achieves 60% lower bitrate than a state-of-the-art SC method without

decreasing accuracy and is up to 16x faster than ofﬂoading with existing codec standards.

Index Terms—Split Computing, Distributed Inference, Edge Computing, Edge Intelligence, Learned Image Compression, Data

Compression, Neural Data Compression, Feature Compression, Knowledge Distillation

✦

1 INTRODUCTION

DEEP Learning (DL) has demonstrated that it can solve

real-world problems in challenging areas ranging from

Computer Vision (CV) [1] to Natural language Process-

ing (NLP) [2]. Complementary with the advancements in

mobile edge computing (MEC) [3] and energy-efﬁcient

AI accelerators, visions of intelligent city-scale platforms

for critical applications, such as mobile augmented real-

ity (MAR) [4], disaster warning [5], or facilities manage-

ment [6], seem progressively feasible. Nevertheless, the

accelerating pervasiveness of mobile clients gave unprece-

dented growth in Machine-to-Machine (M2M) communi-

cation [7], leading to an insurmountable amount of net-

work trafﬁc. A root cause is the intrinsic limitation of

mobile devices that allows them to realistically host a single

lightweight Deep Neural Network (DNN) in memory at a

time. Local resources cannot meet the demanding require-

ments of applications that rely on multiple highly accurate

DNNs [8], [9]. Hence, clients must frequently ofﬂoad infer-

ence requests [10].

The downside to ofﬂoading is that by constantly stream-

ing high-dimensional visual data, the limited bandwidth

will inevitably lead to network congestion, resulting in

erratic response delays, and it leaves valuable client-side

resources idle.

Split Computing (SC) emerged as an alternative to al-

leviate inefﬁcient resource utilization and to facilitate low-

latency and performance-critical mobile inference. The basic

idea is to partition a DNN to process the shallow layers with

the client and send a processed representation to the remain-

ing deeper layers deployed on a server. The SC paradigm

can potentially draw resources from the entire edge-cloud

compute continuum. However, current SC methods are

only conditionally applicable (e.g., in highly bandwidth-

constrained networks) or tailored toward speciﬁc neural

network architectures. Methods that claim to generalize

towards a broader range of architectures do not consider

that mobile clients can typically only load a single model

into memory. Consequently, SC methods are impractical for

applications with complex requirements relying on infer-

ence from multiple models concurrently (e.g., MAR). Mobile

clients reloading weights from its storage into memory

and sending multiple intermediate representations for each

pruned model would incur more overhead than directly

transmitting image data with fast lossless codecs. Moreover,

due to the conditional applicability of SC, practical meth-

ods rely on a decision mechanism that periodically probes

external conditions (e.g., available bandwidth), resulting in

further deployment and runtime complexity [11].

This work shows that we can address the increasing need

to reduce bandwidth consumption while simultaneously

generalizing the objective of SC methods to provide mobile

clients access to low-latency inference from remote off-the-

shelf discriminative models even in constrained networks.

We draw from recent advancements in lossy learned

image compression (LIC) and the Information Bottleneck

(IB) principle [12]. Despite outperforming handcrafted

codecs [13], such as PNG, or WebP [14], LIC is unsuitable

for real-time inference in MEC since they consist of large

models and other complex mechanisms that are demand-

ing even for server-grade hardware. Further, research in

compression primarily focuses on reconstruction for human

perception containing information superﬂuous for M2M

communication. In comparison, the deep variational infor-

mation bottleneck (DVIB) provides an objective for learned

This article has been accepted for publication in IEEE Transactions on Mobile Computing. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/TMC.2024.3381952

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

feature compression with DNNs, prioritizing information

valuable for machine interpretability.

With DVIB, we can conceive generalizable methods

that are applicable to off-the-shelf architectures. However,

current DVIB approaches typically place the bottleneck at

the penultimate layer. Thus, they are unsuitable for most

common settings that assume an asymmetric resource allo-

cation between the client and the server. In other words, the

objectives of DVIB and MEC contradict each other, i.e., for

the latter, we require shifting the bottleneck’s location to the

shallow layers.

We accommodate the restrictions of mobile clients by in-

troducing a method that moves the bottleneck to the shallow

layers and retains generalizability to arbitrary architectures.

While shifting the bottleneck does not formally change

the objective, we will demonstrate that existing methods

for mutual information estimation lead to unsatisfactory

results.

To this end, we introduce FrankenSplit: A novel training

and design heuristic for variational feature compression

models embeddable in arbitrary DNN architectures with

pre-trained weights for high-level vision tasks. FrankenSplit

is refreshingly simple to implement and deploy without

additional decision mechanisms that rely on runtime com-

ponents for probing external conditions. Additionally, by

deploying a single lightweight encoder, the client can access

state-of-the-art accuracy from multiple large server-grade

models without reloading weights from memory for each

task. Lastly, the approach does not require modifying dis-

criminative models (e.g., by ﬁnetuning weights). Therefore,

we can directly utilize foundational off-the-shelf models and

seamlessly integrate FrankenSplit into existing systems.

We open-source our repository 1as an addition to the

community for researchers to reproduce and extend our

experiments. In summary, our contributions are:

•Thoroughly exploring how shallow and deep bottle-

neck injection differ for feature compression.

•Introducing a novel saliency-guided training method

to overcome the challenges of training a lightweight

encoder with limited capacity to compress features

usable for several downstream tasks.

•Introducing a generalizable design heuristic for em-

bedding a variational feature compression model into

arbitrary DNN architectures.

Section 2 discusses relevant work on SC and LIC. Sec-

tion 3 discusses the limitations of SC methods and moti-

vates neural feature compression. Section 4 describes the

problem domain. Section 5 progressively introduces the

solution approach. Section 6 extensively justiﬁes relevant

performance indicators and evaluates several implementa-

tions of FrankenSplit against various baselines to assess our

method’s efﬁcacy. Lastly, Section 7 summarizes this work

and highlights limitations to motivate follow-up work.

1. https://github.com/rezafuru/FrankenSplit

2 RE LATED WORK

2.1 Neural Data Compression

2.1.1 Learned Image Compression

The goal of (lossy) image compression is minimizing bitrates

while preserving information critical for human perception.

Transform coding is a basic framework of lossy compres-

sion, which divides the compression task into decorrelation

and quantization [15]. Decorrelation reduces the statistical

dependencies of the pixels, allowing for more effective

entropy coding, while quantization represents the values

as a ﬁnite set of integers. The core difference between

handcrafted and learned methods is that the former relies on

linear transformations based on expert knowledge. Contrar-

ily, the latter is data-driven with non-linear transformations

learned by neural networks [16].

Ball´

e et al. introduced the Factorized Prior (FP) entropy

model and formulated the neural compression problem by

ﬁnding a representation with minimal entropy [17]. An

encoder network transforms the original input to a latent

variable, capturing the input’s statistical dependencies. In

follow-up work, Ball´

e et al. [18] and Minnen et al. [19]

extend the FP entropy model by including a hyperprior

as side information for the prior. Minnen et al. [19] intro-

duce the joint hierarchical priors and autoregressive entropy

model (JHAP), which adds a context model to the existing

scale hyperprior latent variable models. Typically, context

models are lightweight, i.e., they add a negligible number

of parameters, but their sequential processing increases the

end-to-end latency by orders of magnitude.

2.1.2 Feature Compression

Singh et al. demonstrate a practical method for the Infor-

mation Bottleneck principle in a compression framework

by introducing the bottleneck in the penultimate layer and

replacing the distortion loss with the cross-entropy for im-

age classiﬁcation [20]. Dubois et al. generalized the VIB for

multiple downstream tasks and were the ﬁrst to describe

the feature compression task formally [21]. However, their

encoder-only CLIP compressor has over 87 million parame-

ters. Both Dubois and Singh et al. consider feature compres-

sion for mass storage, i.e., they assume the data is already

present at the target server. In contrast, we consider how

resource-constrained clients must ﬁrst compress the high-

dimensional visual data before sending it over a network.

Closest to our work is the Entropic Student (ES) pro-

posed by Matsubara et al. [22], [23], as we follow the same

objective of real-time inference with feature compression.

Nevertheless, they simply apply the learning objective and

a scaled-down version of autoencoder from [17], [18]. Con-

trastingly, we carefully examine the problem domain of

resource-conscious feature compression to identify under-

lying issues with current methods, allowing us to conceive

novel solutions with signiﬁcantly better rate-distortion per-

formance.

2.2 Split Computing

We distinguish between two orthogonal approaches to SC.

This article has been accepted for publication in IEEE Transactions on Mobile Computing. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/TMC.2024.3381952

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

2.2.1 Split Runtimes

Split runtime systems are characterized by performing no

or minimal modiﬁcations on off-the-shelf DNNs. The ob-

jective is to dynamically determine split points according

to the available resources, network conditions, and intrinsic

model properties. Hence, split runtimes primarily focus on

proﬁlers and adaptive schedulers. Kang et al. performed

extensive compute cost and feature size analysis on the

layer-level characterizations of DNNs and introduced the

ﬁrst split runtime system [24]. Their study has shown that

split runtimes are only sensible for DNNs with an early

natural bottleneck, i.e., models performing aggressive di-

mensionality reduction within the shallow layers. However,

most modern DNNs increase feature dimensions until the

last layers for better representation. Consequently, follow-

up work focuses on feature tensor manipulation [25]–[27].

We argue against split runtimes since they introduce con-

siderable complexity. Worse, the system must be tuned

toward external conditions, with extensive proﬁling and

careful calibration. Additionally, runtimes raise overhead

and another point of failure by hosting a network-spanning

system. Notably, even the most sophisticated methods still

rely on a natural bottleneck, evidenced by how state-of-the-

art split runtimes still report results on superseded DNNs

with an early bottleneck [28], [29].

2.2.2 Artiﬁcial Bottleneck Injection

By shifting the effort towards modifying and re-training an

existing base model (backbone) to replace the shallow layers

with an artiﬁcial bottleneck, bottleneck injection retains the

simplicity of ofﬂoading. Eshratifar et al. replace the shallow

layers of ResNet-50 with a deterministic autoencoder net-

work [30]. A follow-up work by Jiawei Shao and Jun Zhang

further considers noisy communication channels [31]. Mat-

subara et al. [32], and Sbai et al. [33] propose a more general

network agnostic knowledge distillation (KD) method for

embedding autoencoders, where the output of the split

point from the unmodiﬁed backbone serves as a teacher.

Lastly, we consider the work in [22] as the state-of-the-art

for bottleneck injection.

Although bottleneck injection is promising, there are two

problems with current methods. They rely on deterministic

autoencoders for a crude approximation to compression or

are intended for a speciﬁc class of neural network architec-

ture.

This work addresses both limitations of such bottleneck

injection methods.

3 TH E CAS E FOR NEURAL DATA COMPRESSION

We assume an asymmetric resource allocation between

the client and the server, i.e., the latter has considerably

higher computational capacity. Additionally, we consider

large models for state-of-the-art performance of non-trivial

discriminative tasks unsuitable for mobile clients. Progress

in energy-efﬁcient ASICs and embedded AI with model

compression with quantization, channel pruning, etc., per-

mit constrained clients to execute lightweight DNNs. Never-

theless, they are bound to reduced predictive strength rela-

tive to their contemporary unconstrained counterparts [34].

This assumption is sensible considering the trend for DNNs

towards pre-trained foundational models with rising com-

putational requirements due to increasing model sizes [35]

and costly operations [36].

Lastly, mobile devices cannot realistically load weights

for multiple models simultaneously [9], and it is unreason-

able to expect that a single compressed model is sufﬁcient

for applications with complex requirements that rely on

various models concurrently or in quick succession.

Conclusively, despite the wide availability of onboard

accelerators, the demand for large models to solve intel-

ligent tasks will further increase, transmitting large vol-

umes of high-dimensional data. The claim is consistent with

CISCO’s report that emphasizes the accelerating bandwidth

consumption by M2M communication [7].

3.1 Limitations of Split Computing

Still, it would be valuable to leverage advancements in

energy-efﬁcient mobile chips beyond applications where

local inference is sufﬁcient. In particular, SC can poten-

tially draw resources from an entire edge-cloud compute

continuum while binary on- or ofﬂoading decision mech-

anisms will leave valuable client or server-side resources

idle. Figure 1 illustrates generic on/ofﬂoading and split

runtimes. The caveat is that both SC approaches discussed

Local and Remote Inference

load store

Compressor

PNG WEBPJPG

Head Models (Shallow Layers)

decision

0.07

0.14

0.58

Mobile Discriminative Models

Discriminative Models

Split Runtimes

0.11

0.03

0.62

0.57

0.18

0.09

0.25

0.41

0.13

0.25

0.41

0.13

0.57

0.18

0.09

0.11

0.03

0.62

Discriminative Models

Client Server Weights

Storage Runtime

Transfer

Request

Alternative

(Mobile)

AI Accelerator

Feature

Tensor

Decompressor

PNG WEBPJPG

Compressor

PNG WEBPJPG

Decompressor

PNG WEBPJPG

load store

Fig. 1: Prediction with on/ofﬂoading and split runtimes

in Section 2.2 are only conditionally applicable. In particular,

split runtimes reduce server-side computation for inference

tasks with off-the-shelf models by onloading and executing

shallow layers at the client. This approach introduces two

major limitations.

First, when the latency is crucial, this is only sensi-

ble if the time for client-slide execution, transferring the

features, and remotely executing the remaining layers is

less than the time of directly ofﬂoading the task. More

recent work [27]–[29] relies on carefully calibrated dynamic

decision mechanisms. A runtime component periodically

measures (e.g., network bandwidth) and internal conditions

(e.g., client load) to measure ideal split points or whether

direct ofﬂoading is preferable.

This article has been accepted for publication in IEEE Transactions on Mobile Computing. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/TMC.2024.3381952

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

Second, since the shallow layers must match the deeper

layers, split runtimes cannot accommodate applications

with complex requirements, which is a common justiﬁcation

for MEC (e.g., MAR). Constrained clients would need to

swap weights from the storage in memory each time the

prediction model changes. Worse, the layers must match

even for models predicting the same classes with closely

related architectures.

Hence, it is particularly challenging to integrate split

runtimes into systems that can increase the resource efﬁ-

ciency of servers by adapting to shifting and ﬂuctuating

environments [37], [38]. For example, when a client speciﬁes

a target accuracy and a tolerable lower bound, the system

could select a ResNet-101 that can hit the target accuracy

but may temporarily fall back to a ResNet-50 to ease the

load when necessary.

3.2 Execution Times with Resource Asymmetry

Table 1 summarizes the results of a simple experiment to

demonstrate limitations incurred by resource asymmetry.

The client is an Nvidia Jetson NX2 equipped with an AI

accelerator, and the server hosts an RTX 3090 (see Section 6

for details on hardware conﬁgurations). We measure the

execution times of ResNet variants, classifying a single

3×224 ×224 tensor at two split points.

TABLE 1: Execution Times of Split Models

Model Split

Index

Head

[NX2] (ms)

Head

[3090] (ms)

Tail

[3090] (ms)

Rel. Exec.

[NX2] (%)

Contribution

[NX2] (%)

Stem 1.5055 0.1024 4.9687 23.25 0.037

ResNet-50 Stage 1 8.2628 0.9074 4.0224 67.26 0.882

Stem 1.5055 0.1024 9.8735 13.23 0.021

ResNet-101 Stage 1 8.2628 0.9074 8.9846 47.91 0.506

Stem 1.5055 0.1024 14.8862 9.18 0.015

ResNet-152 Stage 1 8.2628 0.9074 13.8687 37.34 0.374

Similar to other widespread architectural families,

ResNets organize their layers into four top-level layers, and

the top-grained ones recursively consist of ﬁner-grained

ones. While the terminology differs for architectures, we will

uniformly refer to top-level layers as stages and the coarse-

grained layers as blocks.

Split point stem assigns the ﬁrst preliminary block as

the head model. It consists of a convolutional layer with

batch normalization [39] and ReLU activation, followed

by max pooling. Split point Stage 1 additionally assigns

the ﬁrst stage to the head. Notice how the shallow layers

barely constitute the overall computation, even when the

client takes more time to execute the head than the server

for the entire model. Further, compare the percentage of

total computation time and relate them to the number of

parameters. At best, the client contributes to 0.02% of the

model execution when taking 9% of the total computation

time and may only contribute 0.9% when taking 67% off the

computation time.

Despite a powerful AI accelerator, it is evident that

utilizing client-side resources to aid a server is inefﬁcient.

Consequently, SC methods commonly include some form

of quantization and data size reduction to offset resource

asymmetry. In the following, we conceive a hypothetical

SC method to provide intuition behind the importance of

reducing transfer costs.

3.3 Feature Tensor Dimensionality and Quantization

Typically, most work starts with some statistical analysis

of the output layer dimensions, as illustrated in Figure 2.

Excluding repeating blocks, the feature dimensionality is

identical for ResNet-50, -101, and -152. The red line marks

the cutoff where the size of the intermediate feature tensor

is less than the original input. ResNets (including more

modern variants [40]), among numerous recent architec-

tures [35], [36], do not have an early natural bottleneck

and will only drop below the cutoff from the ﬁrst block of

the second stage (S3RB1-2). Since executing until S3RB1-2

is only about 0.06% Modern methods reduce the number of

layers a client must execute with feature tensor quantization

and other clever (typically statistical) methods that statically

or dynamically prune channels [11]. For our hypothetical

Stem-c

Stem-mp

S1RB1-1

S1RB1-2

S1RB1-3

S1RB2-1

S1RB2-2

S1RB2-3

S2RB1-1

S2RB1-2

S2RB1-3

S2RB2-1

S2RB2-2

S3RB1-1

S3RB1-2

S3RB1-3

S3RB2-1

S3RB2-2

S3RB3-3

S4RB1-1

S4RB1-2

S4RB1-3

S4RB2-1

S4RB2-2

S4RB3-3

Layer Name

200000

400000

600000

800000

Output Dimension (CxHxW)

Layer Parameters Distribution

Fig. 2: Output dimensionality distribution for ResNet

method, we use the execution times from Table 1. We

generously assume that the method applies feature tensor

quantization and channel pruning to reduce the expected

data size without a loss in accuracy for the ImageNet classi-

ﬁcation task [41] and with no computational costs. Further,

we reward the client for executing deeper layers to reﬂect

deterministic bottleneck injection methods, such that the

output size of the stem and stage one are 802816 and 428168

bits, respectively. Note that, for stage one, this is roughly

a 92% reduction relative to its original FP32 output size.

Yet, the plots in Figure 3 show that ofﬂoading with PNG,

let alone more modern lossless codecs (e.g., WebP), will

beat SC in total request time, except when the data rate is

severely constrained. Evidently, using reasonably powerful

10 20 30 40 50 60 70

Data rate (Mbps)

Total Latency (ms)

ResNet-50 (Split: Stem)

10 20 30 40 50 60 70

Data rate (Mbps)

Total Latency (ms)

ResNet-101 (Split: Stem)

10 20 30 40 50 60 70

Data rate (Mbps)

Total Latency (ms)

ResNet-152 (Split: Stem)

10 20 30 40 50 60 70

Data rate (Mbps)

Total Latency (ms)

ResNet-50 (Split: Stage 1)

10 20 30 40 50 60 70

Data rate (Mbps)

Total Latency (ms)

ResNet-101 (Split: Stage1)

10 20 30 40 50 60 70

Data rate (Mbps)

Total Latency (ms)

ResNet-152 (Split: Stage1)

SC with Dimensionality Reduction Offloading with PNG

Fig. 3: Inference latency for SC and ofﬂoading

AI accelerators to execute the shallow layers of a target

model is not an efﬁcient use of client-side resources.

This article has been accepted for publication in IEEE Transactions on Mobile Computing. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/TMC.2024.3381952

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

3.4 The advantage of learned methods

In a narrow sense, more modern work on SC considers min-

imizing transmitting data with feature tensor quantization

and other clever (typically statistical) methods that statically

or dynamically prune channels.

While dimensionality reduction can be seen as a crude

approximation to compression, it is not equivalent to it [19].

Compression aims to reduce the entropy of the latent under

a prior shared between the sender and the receiver [16].

Dimensionality reduction (especially channel pruning) may

seem effective for simple tasks (e.g., CIFAR-10 [42]). How-

ever, this is more due to the overparameterization of large

DNNs. Precisely, for a simple task, we can prune most

channels or inject a small autoencoder at the shallow layers

that may appear to achieve unprecedented compression

rates relative to the unmodiﬁed head’s feature tensor size.

In Section 6.3.9, we will show that methods working reason-

ably well on a simple dataset can ultimately falter on more

challenging datasets.

From an information-theoretic point of view [43], tensor

dimensionality is not an adequate measure (i.e., C×H×W×

Precision) to determine transfer costs. Instead, we should

consider the entropy of the feature tensor [43]. Then, we can

optimize a model to reduce uncertainty and compress an

input according to its information content.

To summarize, the potential of SC is inhibited by primar-

ily focusing on shifting parts of the model execution from

the server to the client. SC’s viability is not determined by

how well they can partially compute a split network but

by how well they can reduce the input size. Therefore, we

pose the following question: Is it more efﬁcient to focus the

local resources exclusively on compressing the data rather than

executing shallow layers of a network that would constitute a

negligible amount of the total computation cost on the server?

Shallow Variational Bottleneck Injection

(Pruned) Discriminative Models

Entropy Model

Restoration & Transformation

0.57

0.18

0.09

Shallow Bottleneck

Encoder

0.25

0.41

0.13

0.11

0.03

0.62

Decoder

Entropy Model

(Mobile)

Client Server Transfer

Request

Alternative

Accelerator

Transformed (De)coder

Entropy

Feature Tensor

Feature

Tensor ED

Fig. 4: Prediction with Variational Bottleneck Injection

In Figure 4, we sketch predictions with our proposed

approach. There are two underlying distinctions to common

SC methods.

First, the model is not split between the client and

the server. Instead, it deploys a lightweight encoder, and

a decoder replaces the shallow layers of a backbone, i.e.,

the backbone is split within the server. A single decoder

architecture corresponds to backbones with related archi-

tectures. Notably, a decoder restores and transforms the

compressed signals to a backbone that may accommodate

multiple tasks. The encoder is decoupled from a particular

task and the decoder-backbone pair. Section 5.3 elaborates

how separating the concerns permits one encoder instance

to accommodate multiple decoder-backbone pairs.

Second, compared to split runtimes, the decision to

apply the compression model may only depend on internal

conditions. It can decouple the client from any external com-

ponent (e.g., server, router). Ideally, applying the encoder

should always be preferable if a mobile device has the

minimal required resources. Since our method does not alter

the backbones, we do not need to maintain additional mod-

els to accommodate clients who cannot apply the encoder.

Instead, we can simply route the image tensor to the input

layer of the (unmodiﬁed) model.

The following describes the limitations of existing work

for constrained devices to conceive a method with the

abovementioned description.

4 PROB LE M FORMULATION

The goal is for constrained clients to request real-time

predictions from a large DNN while maximizing resource

efﬁciency and minimizing bandwidth consumption with

compression methods. Figure 5 illustrates the possible ap-

proaches when dedicating client resources exclusively for

compression. Strategy a) corresponds to ofﬂoading strate-

Edge/Mobile Device

Restoration & Transformation Decoder

Image Decompressor (Learned)

Edge/Cloud Server

Compressed

Bitstring

Image Decompressor

(Handcrafted)

Image Compressor

(Handcrafted)

Feature Compressor (Learned)

Compressed

Bitstring

Compressed

Bitstring

0.03

0.62

0.11

0.07

....

Compressor

Image Compressor (Learned)

PNG JPG

WEBP

01100011

01100101

01100011

01100101

01100011

01100101

01100011

01100101

01100011

01100101

01100011

01100101

01100011

01100101

01100011

01100101

01100011

01100101

Channel

PNG JPG

WEBP

0.03

0.62

0.11

0.07

....

0.03

0.62

0.11

0.07

....

Fig. 5: Utilizing client resources with (learned) codecs

gies with CPU-bound handcrafted codecs. Strategy b) rep-

resents recent LIC models. Learned methods can achieve

considerably lower bitrates with comparable distortion than

commonly used handcrafted codecs [16]. Nevertheless, we

must consider that the overhead of executing large DNNs

may dominate the reduced transfer time. Strategy c) is our

advocated method with an embeddable variational feature

compression that draws from the same underlying Nonlin-

ear Transform Coding (NTC) framework as b). The chal-

lenge is to reduce overhead to make variational compression

models suitable for real-time prediction with limited client

resources.

To overcome the limitations of existing methods, we

require (i) a resource-conscious encoder design. The encoder

should minimize the transfer cost without increasing the

predictive loss. Additionally, (ii) the decoder should exploit

the available server-side resources without incurring sig-

niﬁcant overhead. Lastly, (iii) a compression model should

ﬁt for different downstream tasks and architectural families

(e.g., CNNs or Vision Transformers).

Before we can conceive an adequate method, we must

formalize the properties of a suitable objective and elaborate

on the limitations of existing methods when applied to

shallow bottlenecks.

This article has been accepted for publication in IEEE Transactions on Mobile Computing. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/TMC.2024.3381952

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

4.1 Rate-Distortion Theory for Model Prediction

By Shannon’s rate-distortion (r-d) theory [44], we seek a

mapping bound by a distortion constraint from a random

variable (r.v.) Xto an r.v. U, minimizing the bitrate of the

outcomes of X. More formally, given a distortion measure

Dand a distortion constraint Dc, the minimal bitrate is:

min

PU|X

I(X;U)s.t. D(X, U )≤Dc(1)

where I(X;U)is the mutual information and is deﬁned as

I(X;U) = ZZp(x, u) log p(x, u)

p(x)p(u)dxdu (2)

In lossy image compression, Uis typically the reconstruc-

tion of ˜

Xof the original input, and the distortion measure is

some sum of squared errors d(x, ˜x). Since the r-d theory

does not restrict us to image reconstruction [45], we can

apply distortion measures relevant to M2M communication.

Notably, when our objective is to minimize predictive loss

rather than reconstructing the input, we keep information

that may be excessive for model predictions.

To elaborate on the potential of discarding information

for discriminative tasks, consider the Data Processing In-

equality (DPI). For any 3 r.v.s X, Y , Z that form a Markov

chain X↔Y↔Zwhere the following holds:

I(X;Y)≥I(X;Z)(3)

Then, describe the information ﬂow in an n-layered se-

quential DNN, layer with the information path by viewing

layered neural networks as a Markov chain of successive

representations [46]:

I(X;Y)≥I(R1;Y)≥I(R2;Y)≥. . . I(Rn;Y)≥I(˜

Y;Y)

(4)

In other words, the ﬁnal representation before a prediction

Rncannot have more mutual information with the target

than the input Xand typically has less. In particular, for

high-level vision tasks that map a high dimensional input

vector with strong pixel dependencies to a small set of

labels, we can expect I(X;Y)≫I(˜

Rn, Y ).

4.2 From Deep to Shallow Bottlenecks

When the task is to predict the ground-truth labels Yfrom a

joint distribution PX,Y , the r-d objective is essentially given

by the information bottleneck principle [12]. By relaxing the

(1) with a lagrangian multiplier, the objective is to maximize:

I(Z;Y)−βI (Z;X)(5)

Speciﬁcally, an encoding Zshould be a minimal sufﬁcient

statistic of Xrespective Y, i.e., we want Zto contain rel-

evant information regarding Ywhile discarding irrelevant

information from X. Practical implementations differ by the

target task and how they approximate (5). For example, an

approximation of I(Z;Y)for an arbitrary classiﬁcation task

the conditional cross entropy (CE) [13]:

D=H(PY, P ˜

Y|Z)(6)

Using (6) for estimating I(Z;Y)to end-to-end optimize a

neural compression model is not a novel idea (Section 2.1.2).

However, a common assumption in such work is that the

latent variable is the ﬁnal representation Rnof a large back-

bone, which we refer to as Deep Variational Information Bot-

tleneck Injection (DVBI). Conversely, we work with resource-

constrained clients, i.e., to conceive lightweight encoders,

we must shift the bottleneck to the shallow layers, which

we refer to as Shallow Variational Bottleneck Injection (SVBI).

Intuitively, the existing methods for DVBI should generalize

to SVBI, e.g., estimate the distortion term with (6) as in [20].

While shifting the bottleneck to the shallow layers re-

sults in an encoder with less capacity, the objective still

approximates to (1). Yet, as we will show in Section 6.3.9,

applying the objective from [20] will result in incomparably

worse results when moving the bottleneck to the shallow

bottlenecks.

A more promising method to estimate I(Z;Y)is Head

Distillation (HD) [32], [33] since it naturally aligns with shal-

low bottlenecks. As we will show in Section 6.3, HD yields

signiﬁcantly better results than applying (6). Surprisingly,

despite showing promising results, HD is a suboptimal

estimation for I(Z;X)to approximate eq. (1).

The following elaborates on SVBI and formulates the VIB

objective for HD.

4.3 Head Distilled Deep Variational IB

Ideally, the bottleneck is embeddable in an existing predictor

PTwithout decreasing the performance. Therefore, it is not

the hard labels Ythat deﬁne the task but the soft labels

YT. For simplicity, we handle the case for one task and

defer discussion on multiple downstream tasks and DNNs

to Section 5.3.

To perform SVBI, take a copy of PT. Then, mark the

location of the bottleneck by separating the copy into a head

Phand a tail Pt. Importantly, both parts are deterministic,

i.e., for every realization of r.v. Xthere is a representation

Ph(x) = hsuch that PT(x) = Ph(Pt(x)). Lastly, replace the

head with an autoencoder and a parametric entropy model.

The encoder is deployed at the sender, the decoder at the

receiver, and the entropy model is shared. We distinguish

between two optimization strategies to train the bottleneck’s

compression model. First, is direct optimization correspond-

ing to the DVIB objective in (5), except we replace the CE

with the standard KD loss [47] to estimate I(Z;Y). The

second is indirect optimization and describes HD with the

objective:

I(Z;H)−β I(Z;X)(7)

Unlike the former, the latter does not directly correspond

to (1) for a representation Zthat is a minimal sufﬁcient

statistic of Xrespective YT. Instead, it replaces Ywith a

proxy task for the compression model to replicate the output

of the replaced head, i.e., training methods approximating

(7) optimize for a Zthat is a minimal sufﬁcient statistic of

Xrespective H. Figure 6 illustrates the difference between

estimating the objectives (5) and (7).

With faithful replication of H, the partially modiﬁed

DNN has an information path equivalent to its unmodiﬁed

version. A sufﬁcient statistic retains the information neces-

sary to replicate the input for a deterministic tail, i.e., the

ﬁnal prediction does not change. The problem of (7) is that it

is a suboptimal approximation of (1). Although sufﬁciency

This article has been accepted for publication in IEEE Transactions on Mobile Computing. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/TMC.2024.3381952

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

Backbone

(Tail)

Embedded Model

(Student)

Teacher (Head)

2.59

15.43

0.05

8.36

....

8.32

1.15

0.46

3.84

....

0.03

0.62

0.11

0.07

....

0.19

0.43

0.01

0.26

....

Frozen Trainable Optional Trainable

Fig. 6: Left: Head Distillation. Right: Direct Optimization

still holds, it optimizes Zrespective Hand not YTto be

minimal.

Based on the above formulations, the following proposes

a practical method to train stochastic feature compression

models. Additionally, it addresses the limitations of HD and

includes architectural considerations.

5 SOLUTION AP PR OACH

Our solution focuses on two distinct but intertwined as-

pects. First is an appropriate training objective. The second

concerns a practical implementation by introducing an ar-

chitectural design heuristic to accommodate backbones with

various architectures with a single encoder architecture.

5.1 Loss Function for End-to-end Optimization

We follow NTC [16] to implement a neural compression

algorithm. Speciﬁcally, we embed a stochastic compression

model that we jointly optimize with an entropy model.

Our objective resembles variational image compression

optimization, as introduced in [17], [18]. For an image vector

x, we have a parametric analysis transform ga(x;ϕg)that

maps xto a latent vector z. Then, a quantizer Qdiscretizes

zto ¯z, such that an entropy coder can use the entropy

model to losslessly compress ¯zto a sequence of bits. In

learned image compression, a parametric synthesis trans-

forms gs(¯z;θg)maps ¯zto a reconstruction of the input ˜x

However, we favor HD over direct optimization as a

distortion measure since the former yields considerably

better results even with a suboptimal loss function (Sec-

tion 6.3.9). Therefore, we require a gs(¯z;θg)that maps ¯zto

an approximation of a representation ˜

h(i.e., the output of

shallow layers of an arbitrary backbone).

Analogous to variational inference, we approximate the

intractable posterior p(˜z|x)with a parametric variational

density q(˜z|x)as follows (excluding constants):

Ex∼pxDKL q∥p˜

z|x=Ex∼px

E˜

z∼q[−log p(x|˜z)

| {z }

distortion

−

weighted rate

z }| {

log p(˜z) ]

(8)

By assuming a Gaussian distribution such that the likeli-

hood of the distortion term is given by

Px|˜z(x|˜z, θg) = N(x|gs( ˜z;θg),1)(9)

we can use the square sum of differences between hand ˜

as our distortion loss.

The rate term describes the cost of compressing ˜z. Anal-

ogous to the LIC methods discussed in Section 2.1, we apply

uniform quantization ¯z=⌊˜z⌉. Since discretization leads to

problems with the gradient ﬂow, we apply a continuous re-

laxation by adding uniform noise η∼U(−1

2,1

2). Combining

the rate and distortion term, we derive the loss function for

estimating objective (7) as:

L=∥Ph(x)-(gs(ga(x;ϕg) + η;θg)∥2

2+βlog(ga(x;θg) + η)

(10)

As described in Section 4.3, by using HD for the distortion

term, we rely on Has a proxy target, i.e., the loss in

Equation (10) is a suboptimal approximation of (1).

The suboptimality stems from treating every pixel in H

equally important. The implication here is that the MSE in

(10) overly strictly penalizes pixels at spatial locations that

contain redundant information that later layers can safely

discard. Contrarily, the loss may not penalize the salient

pixels enough when ˜

his numerically close to h.

Hence, we can improve the loss in (10) by introduc-

ing additional signals that regularize the suboptimal dis-

tortion term. The challenge is ﬁnding a tractable method

that emphasizes the salient pixels necessary for multiple

instances of a high-level vision task (e.g., classiﬁcation of

various datasets and labels). Moreover, the method should

exclusively concern the loss function, i.e., it should not

introduce any additional model components or operations

during inference.

5.2 Saliency Guided Distortion

We consider HD an extreme form of Hint Training (HT) [48],

[49] where the hint becomes the primary objective rather

than an auxiliary regularization term. Sbai et al. perform

deterministic bottleneck injection with HD using the subop-

timal distortion term [33]. Nevertheless, their method only

considers dimensionality reduction without a parametric

entropy model as an approximation to compression, i.e.,

it is generalized by the loss in (1)(β= 0). Matsubara et

al. add further hints from the deeper layers by extending

the distortion term with the sum of squared between the

deeper layers [23], [32]. This approach has several down-

sides besides prolonged train time. The distortion term

may now dominate the rate term, i.e., without exhaustively

tuning the hyperparameters for each distortion term, the

optimization algorithm should favor converging towards

local optima. Moreover, we show Section 6.3.2 that pure HD

can signiﬁcantly outperform this method using the loss in

Equation (1) without the hints from the deeper layers.

In principle, we could improve the performance by ex-

tracting signals from deeper layers and directly transferring

them to the bottleneck. The caveat is that the effectiveness

of knowledge distillation decreases for teachers when the

student has considerably less capacity than the teacher [48].

Hence, instead of directly introducing hints at the encoder,

we propose regularizing the distortion term with saliency

maps.

For each sample, we require a vector S, where each

si∈ S is a weight term for a spatial location salient about

the conditional probability distributions of the remaining

This article has been accepted for publication in IEEE Transactions on Mobile Computing. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/TMC.2024.3381952

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

tail layers. Then, we should be able to improve the r-d

performance by regularizing the distortion term in (10) with

Ldistortion =γ1· L1+γ2·si·1

(hi−˜

hi)2(11)

Where L1is the distortiont term from Equation (10), and

γ1, γ2are nonnegative real numbers summing to 1. We

default to γ1=γ2=1

2in our experiments. Figure 7

Teacher

Stage Stage Extracted Saliency Maps

Load Saliency Map

Guided Distortion

Rate Estimation

Compressor

Head Stage

Fig. 7: Training setup

describes our ﬁnal training setup. Note that we only require

computing the saliency maps once, and they are architec-

turally agnostic towards the encoder.

We derive the saliency maps using class activation map-

ping (CAM) [50]. Although CAMs are typically used to

improve the explainability of DNNs, they suit our purposes

by allowing us to summarize salient pixel locations. Specif-

ically, we use a variant of Grad-CAM [51] to measure a

spatial location’s importance at any stage. Figure 8 illus-

trates some examples of saliency maps when averaged over

the deeper backbone stages. In this work, we favor Grad-

Fig. 8: Extracted saliency maps using Grad-CAM

CAM over (more intricate) methods due to its architecture-

agnostic nature and computational efﬁciency. For example,

mixing with guided backpropagation [52] could reﬁne the

resulting saliency maps with ﬁner-grained feature impor-

tance scaling. However, guided backpropagation relies on

speciﬁc properties of the activation function and requires

adjustments for each architectural family.

5.3 Network Architecture

The beginning of this section broke down our aim into three

problems. We addressed the ﬁrst with SVBI and proposed

a novel training method for low-capacity compression

models. A generalizable resource-asymmetry-aware autoen-

coder design remains. Additionally, the encoder should be

reusable for several backbones. To not inﬂate the signiﬁ-

cance of our contribution, we refrain from including com-

ponents based on existing work in efﬁcient neural network

design.

5.3.1 Model Taxonomy

We introduce a minimal taxonomy described in Figure 9

for our approach. The top-level, Archtype, reﬂects the pri-

mary inductive bias of the model. Architectural families de-

scribe variants (e.g., ResNets such as ResNet [53], Wide

ResNet [54], ResNeXt [40], etc.). Directly related refers to the

same architecture of different sizes (e.g., Swin-T, Swin-S,

Swin-B, etc.). The challenge is to conceive a design heuristic

CNNs Vision Transformers

Vision Models

Archtype

Architectural

Family

Directly

ResNets Hierarchical Vision

Transformers

ResNet-50/101/152 Swin-T/S/B/L

ConvNeXt

ConvNeXts

ConvNeXt-S/B/L

Fig. 9: Simple taxonomy with minimal example

that can exploit the available server resources to aid the

lightweight encoder with minimal overhead on the predic-

tion task. First, we concretize shallow features by describing

how to locate the layers for bottleneck placement. Then, we

derive the heuristic to conceive decoder models for arbitrary

architectural families and how to account for client-server

resource asymmetry.

Lastly, we describe how to share trained compressor

components among directly related architectures.

5.3.2 Bottleneck Location by Stage Depth

Consider how most modern DNNs consist of an initial

embedding followed by a few stages (Described in Sec-

tion 3.1). Within directly related architectures, the individual

components are identical. The difference between variants

is primarily the embed dimensions or the block ratio of

the deepest stage. For example, the block ratio of ResNet-

50 is 3:4:6:3, while the block ratio of ResNet-101 is 3:4:23:3.

Consequently, the stage-wise organization of models deﬁnes

a natural interface for SVBI. For the remainder of this work,

we refer to the shallow layers as the layers before the deepest

stage (i.e., the initial embedding and the ﬁrst two stages).

5.3.3 Decoder Blueprints

A key characteristic distinguishing archetypes is the induc-

tive bias introduced by basic building blocks (e.g., convo-

lutions versus attention layers). To consider the varying

representations among non-related architectures, we should

not disregard architecture-induced bias by directly repur-

posing neural compression models for SC. For example,

This article has been accepted for publication in IEEE Transactions on Mobile Computing. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/TMC.2024.3381952

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

a scaled-down version of Ball´

e et al.’s [17] convolutional

neural compression model can yield strong r-d performance

for bottlenecks reconstructing a convolutional layer [22].

However, we will show that this does not generalize to other

architectural families, such as hierarchical vision transform-

ers [55].

One potential solution is to use identical components for

the compression model from a target network. While this

may be inconsequential for server-side decoders, it is inade-

quate for encoders due to the heterogeneity of edge devices.

Vendors have varying support for the basic building blocks

of a DNN, and particular operations may be prohibitively

expensive for the client. Hence, in FrankenSplit, the encoder

is ﬁxed, but the decoder is adaptable. Regardless of the

decoder architecture, we account for the heterogeneity with

a uniform encoder architecture composed of three down-

sampling residual blocks of two stacked 3×3convolutions

with ReLU non-linearity, totaling around 140,000 parame-

ters. We handle the varying representations by introducing

decoder blueprints tailored towards an architectural family,

i.e., one blueprint corresponds to all directly related archi-

tectures.

RsNet Stage (3)

ResNet-50

Encoder

(Bottleneck)

Decoder - ResNet

Blueprint

Decoder - ResNet

Blueprint

Encoder Block

Encoder

Split ResNet-101

Split ResNet-50

Swin Decoder Blueprint

Restoration Block

Swin BP Blocks (6)

Restoration Block

Swin BP Blocks (1)

Swin Block Blueprint

MLP

W-MSA

MLP

SW-MSA

Predictors

Decoder - Swin

Blueprint

Decoder - Swin

Blueprint Split Swin-S

Split Swin-T Predictors

Predictors

ResNet Stage (23)ResNet Stage (6)

ResNet Stage (2)

ResNet-101

Swin Stage (2)

Split Swin-T

Swin Stage (18)Swin Stage (6)

Swin Stage (2)

Split Swin-S

PS/ConvT

L. ReLU

Restoration Block

ResNet Decoder Blueprint

Restoration Block

ResNet BP Block (9)

Restoration Block

ResNet BP Block (4)

Encoder Block

Conv 3x3

ReLU

Conv 3x3

ReLU

1x1 Conv S=1

3x3 Conv S=1

ResNet Block Blueprint

Task

Prediction

Task

Prediction

Task

Prediction

Predictors

Fig. 10: Reference implementation of FrankenSplit

Figure 10 illustrates a reference implementation of

FrankenSplit post-training with two blueprints applied to

two variants. Creating blueprints is required only once for

an architectural family. Boxes within the gray areas are

separate instances (i.e., only one encoder), and boxes with

the same name share an architecture. The rounded boxes

outside organize layer views from coarse to ﬁne-grained. We

elaborate on how a single encoder can accommodate multi-

ple decoder-backbone pairs in Section 5.3.4. The numbers in

the parentheses refer to stage depth. Since the backbones are

foundational models extensively trained on large datasets,

we can naturally accommodate several downstream tasks

by attaching separately trained predictors.

Blueprint instances replace a backbone’s ﬁrst two stages

(i.e., the shallow layers) with two blueprint stages, taking a

compressed representation as input instead of the original

sample. The work by Liang et al. [56] inspires our approach

to treat decoding as a restoration problem. Each stage com-

prises a restoration block and several blueprint (transforma-

tion) blocks, followed by a residual connection. The idea

is to separate restoration (i.e., upsampling, ‘’smoothing“

quantized features) from transformation (i.e., matching the

target representation regardless of encoder architecture).

The restoration block is agnostic regarding the target ar-

chitecture and optionally upsamples. The blueprint blocks

induce the same bias as the target architectural family.

Two distinctions exist between the original blocks and

their corresponding blueprint (transformation). First, the

latter modiﬁes operations not to reduce the latent spatial

dimensions. Second, the embedding layer dimensions and

stage depths may differ to reﬂect the resource asymmetry

commonly found in MEC.

Although we should consider the resource asymmetry

between the client and the server (i.e., by allocating more

parameters to the decoder), there are limitations. Learning

a function that can accurately retain necessary information

is limited by the encoder’s capacity (Section 4.1). Still, when

end-to-end optimizing the compression model, it can beneﬁt

from increasing the decoder’s capacity for restoration with

diminishing returns.

Intuitively, we implement blueprints that result in de-

coder instances with, at most, the same execution time

as the head of a target backbone. As a reminder, unlike

most work in SC, we advocate keeping the execution time

roughly equal on the server rather than reducing it. The

encoder’s responsibility is not to minimize the server load

by executing shallow backbone layers. FrankenSplit treats

the encoder entirely separate from the backbone. Besides

dedicating the encoder exclusively to reducing transfer size,

this separation of concern is necessary to accommodate

several backbones with a single encoder instance.

5.3.4 Encoder Re-Usability

We argue that the representation of shallow layers general-

izes well enough that it is possible to reuse compressor com-

ponents. Consider the experiment illustrated in Figure 11,

Fig. 11: Routing head outputs to different tails

where we split several backbones into head and tail models.

The backbones are off-the-shelf models from torch image

models (timm) [57] and pre-trained on the ImageNet [41]

dataset. The head models consist of the initial embedding

and shallow layers, i.e., the ﬁrst two stages. The remaining

layers comprise the substantially larger tails (roughly 2−5%

of total model parameters).

Then, we freeze the tail parameters and route the head

output to all non-corresponding tails (e.g., ConvNeXt-T to

Swin-T/S/B) and measure the accuracy every few iterations

with a batch size of 128 as we ﬁnetune the head parameters

using cross entropy loss. Each head-tail pair is a separate

model built by attaching a copy of the head from one ar-

chitecture to the tail of another. Where dimensions between

This article has been accepted for publication in IEEE Transactions on Mobile Computing. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/TMC.2024.3381952

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

head and tail pairs do not match, we add a single 1×1

convolutional layer.

102104

Top-1 Acc [%]

Tail: Swin-B

Head

ConvNeXt-B

ConvNeXt-S

ConvNeXt-T

ResNet-152

ResNet-101

ResNet-50

Swin-T

Swin-S

102104

Iterations

Tail: Swin-S

Head

ConvNeXt-B

ConvNeXt-S

ConvNeXt-T

ResNet-152

ResNet-101

ResNet-50

Swin-T

Swin-B

102104

Tail: Swin-T

Head

ConvNeXt-B

ConvNeXt-S

ConvNeXt-T

ResNet-152

ResNet-101

ResNet-50

Swin-S

Swin-B

Fig. 12: Recovering Top-1 accuracy of rerouted heads

Figure 12 shows how rerouting the input between head

models ﬁrst (0 iterations) results in near 0% accuracy

across all head-tail pairs. However, the concatenated mod-

els quickly converge near their original accuracy (roughly

80 −83%) within just a few iterations (10100 iterations with

128 samples corresponding to one epoch on the ImagNet

dataset). Notice that this holds regardless of whether the

head-tail pairs are directly related to the modiﬁed network.

Therefore, if a compressor can sufﬁciently approximate

the representation of just one head (i.e., the shallow layers of

a network), it should be possible to accommodate arbitrary

tails (i.e., the deeper layers of a network).

Crucially, applying the distortion measure in (10) or (11)

does not result in an inherently different encoder behav-

ior. Like training the compression model with a distortion

measure from LIC, the purpose of the encoder is reducing

uncertainty by decorrelating the data and discarding infor-

mation. The distortion measure only controls what informa-

tion an encoder should prioritize. Regardless of the target

backbone’s architecture, the encoder should decorrelate the

input to reduce uncertainty. Conversely, the decoder seeks a

mapping to the backbone’s representation.

In other words, if we can map the latent to one rep-

resentation, we can map it to any other with comparable

information content. We can freeze the encoder and train

various decoders to support arbitrary architectures once we

train one compression model with a particular teacher as

described in Figure 7. The blueprints facilitate an efﬁcient

transformation from the encoder’s compressed representa-

tion to an input suitable for a particular backbone.

Notice that this method keeps the encoder parameters

frozen, permitting us to deploy a single set of weights across

all clients. Moreover, it does not modify the backbones

at any step. After deployment, splitting is replaced with

rerouting the input to a layer index (Section 5.3.2). Then,

we can serve clients with the same models regardless of

whether they applied the compressor.

6 EVALUATIO N

6.1 Training & Implementation Details

We optimize our compression models initially on the 1.28

million ImageNet [41] training samples for 15 epochs, as

described in section 5.1 and section 5.2, with some slight

practical modiﬁcations for stable training. We aim to mini-

mize bitrate without sacriﬁcing predictive strength. Hence,

we ﬁrst seek the lowest βresulting in lossless prediction.

We use Adam optimization [58] with a batch size of

16 and start with an initial learning rate of 1·10−3, then

gradually lower it to 1·10−6with an exponential scheduler.

To implement our method, we use PyTorch [59], Com-

pressAI [60] for entropy estimation and entropy coding,

and pre-trained backbones from timm [57]. All baseline

implementations and weights were either taken from Com-

pressAI or the ofﬁcial repository of a baseline. To compute

the saliency maps, we use a modiﬁed XGradCAM method

from the library in [61] and include necessary patches in

our repository. Lastly, to ensure reproducibility, we use

torchdistill [62].

6.2 Experiment Setting

The experiments reﬂect the deployment strategies illus-

trated in Figure 5 and Figure 4. Ultimately, we must eval-

uate whether FrankenSplit enables latency-sensitive and

performance-critical applications. Regardless of the particu-

lar task, a mobile edge client requires access to a DNN with

high predictive strength on a server. Therefore, we must

show whether FrankenSplit adequately solves two problems

associated with ofﬂoading high-dimensional image data for

real-time discriminative tasks. First, whether it considerably

reduces the bandwidth consumption compared to existing

methods without sacriﬁcing predictive strength. Second,

whether it improves inference times over various communi-

cation channels, i.e., it must remain competitive even when

stronger connections are available.

Lastly, the evaluation should assess whether our method

generalizes to arbitrary backbones. However, since it is

infeasible to perform exhaustive experiments on all existing

visual models, we focus on three well-known representa-

tives and a subset of their variants instead. Namely, (i)

ResNet [53] for classic residual CNNs. (ii) Swin Trans-

former [55] for hierarchical vision transformers, which are

receiving increasing adaptation for a wide variety of vision

tasks. (iii) ConvNeXt [63] for modernized state-of-the-art

CNNs. Table 2 summarizes the relevant characteristics of

the unmodiﬁed backbones subject to our experiments.

TABLE 2: Overview of Backbone Performance on Server

Backbone Ratios Params Inference

(ms)

Top-1 Acc.

(%)

Swin-T 2:2:6:2 28.33M 4.77 81.93

Swin-S 2:2:18:2 49.74M 8.95 83.46

Swin-B 2:2:30:2 71.13M 13.14 83.88

ConvNeXt-T 3:3:9:3 28.59M 5.12 82.70

ConvNeXt-S 3:3:27:3 50.22M 5.65 83.71

ConvNeXt-B 3:3:27:3 88.59M 6.09 84.43

ResNet-50 3:4:6:3 25.56M 5.17 80.10

ResNet-101 3:4:23:3 44.55M 10.17 81.91

ResNet-152 3:8:36:3 60.19M 15.18 82.54

6.2.1 Baselines

Since our work aligns closest to learned image compression,

we extensively compare FrankenSplit with learned and

This article has been accepted for publication in IEEE Transactions on Mobile Computing. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/TMC.2024.3381952

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

handcrafted codecs applied to the input images, i.e., the

input to the backbone is the distorted output. Comparing

task-speciﬁc methods to general-purpose image compres-

sion methods may seem unfair. However, FrankenSplit’s

universal encoder has up to 260x less trainable parameters

and further reduces overhead by not including side infor-

mation or a sequential context model.

The naming convention for the learned baselines is the

ﬁrst author’s name, followed by the entropy model. Specif-

ically, we choose the work by Ball´

e et al. [17], [18] and

Minnen et al. [19] for LIC methods since they represent

foundational milestones. Complementary, we include the

work by Cheng et al. [64] to demonstrate improvements

with architectural enhancement.

As the representative for disregarding autoencoder size

to achieve state-of-the-art r-d performance in LIC, we chose

the work by Chen et al. [65] Their method differs from other

LIC baselines by using a partially parallelizable context

model, which trades off compression rate with execution

time according to the conﬁgurable block size. We refer to

such context models as Blocked Joint Hierarchical Priors and

Autoregressive (BJHAP). Due to the large autoencoder, we

found evaluating the inference time on constrained devices

impractical when the context model is purely sequential and

set the block size to 64x64. Additionally, we include the

work by Lu et al. [66] as a milestone of the recent effort

on efﬁcient LIC with reduced autoencoders but only for

latency-related experiments since we do not have access to

the trained weights.

As a baseline for the state-of-the-art SC, we include

the Entropic Student (ES) [22], [23]. The ES demonstrates

the performance of directly applying a minimally adjusted

LIC method for feature compression. One caveat is that we

intend to show how FrankenSplit generalizes beyond CNN

backbones, despite the encoder’s simplistic CNN architec-

ture. Although Matsubara et al. evaluate the ES on a wide

range of backbones, most have no lossless conﬁgurations.

Nevertheless, comparing bottleneck injection methods using

different backbones is fair, as we found that the choice does

not signiﬁcantly impact the r-d performance (Section 6.3.5).

Therefore, for an intuitive comparison, we choose ES with

ResNet-50 using the same factorized prior entropy model as

FrankenSplit.

We separate the experiments into two categories to as-

sess whether our proposed method addresses the above-

mentioned problems.

6.2.2 Criteria rate-distortion performance

We measure the bitrate in bits per pixel (bpp) because it per-

mits directly comparing models with different input sizes.

Choosing a distortion measure to draw meaningful and

honest comparisons is challenging for feature compression.

Unlike evaluating reconstruction ﬁdelity for image com-

pression, PSNR or MS-SSIM does not provide intuitive

results regarding predictive strength. Similarly, reporting

absolute values (e.g., top-1 accuracy) gives an unfair ad-

vantage to experiments conducted on higher capacity back-

bones and veils the efﬁcacy of a proposed method.

Hence, for a transparent evaluation, we determine the

adversarial effects of codecs with image classiﬁcation since

it provides an unambiguous performance metric with es-

tablished benchmark datasets. Speciﬁcally, we evaluate the

distortion with the relative measure predictive loss, i.e., the

drop in top-1 accuracy incurred by codecs. In particular,

for SVBI methods, (near) lossless prediction implies that

the reconstruction is a sufﬁcient approximation for shallow

features of an arbitrary feature extractor.

To ensure a fair comparison, we give the LIC and hand-

crafted baselines a grace threshold of 1.0% top-1 accuracy,

to account for mitigating predictive loss incurred by codec

artifacts [67]. For FrankenSplit, we set the threshold at 0.4%,

reﬂecting the conﬁguration with the lowest predictive loss

of the ES. Note that, unlike the ES, FrankenSplit does not

rely on ﬁnetuning the tail parameters of a backbone to

improve r-d performance.

6.2.3 Measuring latency and overhead

To account for the resource asymmetry in MEC, we

use NVIDIA Jetson boards2for representing capable

but resource-constrained mobile clients. Contrastingly, the

server hosts a powerful GPU. Table 3 summarizes the hard-

ware we use in our experiments.

TABLE 3: Clients and Server Hardware Conﬁguration

Device Arch CPU GPU

Server x86 16x Ryzen @ 3.4 GHz RTX 3090

Client (TX2) arm64x8 4x Cortex @ 2 GHz Vol. 48 TC

Client (NX) arm64x8 4x Cortex @ 2 GHz Pas. 256 CC

6.3 Rate-Distortion Performance

We measure the predictive loss by the drop in top-1 accuracy

from Table 2 using the ImageNet validation set for the

standard classiﬁcation task with 1000 categories. Analo-

gously, we measure ﬁlesizes of the entropy-coded binaries

to calculate the average bpp. To demonstrate that we can

accommodate a non-CNN backbone with a CNN encoder,

we start with a Swin-B implementation of FrankenSplit.

Figure 13 shows r-d curves with the Swin-B backbone. The

architecture of FrankenSplit-FP (FS-FP) and FrankenSplit-

SGFP (FS-SGFP) are identical. We train both models with

the loss functions derived in Section 5.1. The difference is

that FS-SGFP is saliency-guided, i.e., FS-FP represents the

pure HD training method and is an ablation to the saliency-

guided distortion.

6.3.1 Effect of Saliency Guidance

Although FS-FP performs better than almost all other mod-

els, it is trained with the suboptimal objective discussed

in Section 4.3. We identiﬁed the issue as overly skewing

the objective needlessly towards the distortion term. Con-

sequently, we proposed regularizing the distortion term by

applying extracted saliency maps in Section 5.2 to improve

the r-d performance. We favor Grad-CAM to compute the

saliency maps over comparable methods for two reasons.

First, it is generically applicable to arbitrary vision models.

Second, it does not introduce additional tunable hyper-

parameters. The suboptimality of the unregularized objec-

tive is demonstrated by FS-SGFP outperforming FS-FP. By

2. nvidia.com/en-gb/autonomous-machines/embedded-systems/

This article has been accepted for publication in IEEE Transactions on Mobile Computing. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/TMC.2024.3381952

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

simply guiding the distortion loss with saliency maps, we

achieve a 25% lower bitrate without impacting predictive

strength or additional runtime overhead.

Fig. 13: Rate-distortion curve for ImageNet

6.3.2 Comparison to the ES

Even without saliency guidance, FS-FP consistently outper-

forms ES by a large margin. Speciﬁcally, FS-FP and FS-

SGFP achieve 32% and 63% lower bitrates for the lossless

conﬁguration.

We ensured that our bottleneck injection incurs compa-

rable overhead for a direct comparison to the ES. Moreover,

the ES has an advantage due to ﬁnetuning tail parameters

in an auxiliary training stage. Therefore, we attribute the

performance gain to the more sophisticated architectural

design decisions described in Section 5.3.

6.3.3 Comparison to Image Codecs

For almost all lossy codec baselines, Figure 13 illustrates

that FS-(SG)FP has a signiﬁcantly better r-d performance.

Comparing FS-FP to Ball´

e-FP demonstrates the r-d gain

of task-speciﬁc compression over general-purpose image

compression. Although the encoder of FrankenSplit has 25x

fewer parameters, both codecs use an FP entropy model

with encoders consisting of convolutional layers. Yet, the

average ﬁle size of FS-FP with a predictive loss of around

5% is 7x less than the average ﬁle size of Ball´

e-FP with

comparable predictive loss.

FrankenSplit also beats modern general-purpose LIC

without including any of their heavy-weight components.

The only baseline FrankenSplit does not convincingly out-

perform is Chen-BJHAP. Nevertheless, in Section 3.4, we

demonstrate that the incurred overhead offsets the compres-

sion gain disproportionately.

6.3.4 Image Codec Incurred Predictive Loss

For clarity, we separately evaluate r-d performance on the

other backbones listed in Table 2 for FrankenSplit and

baseline codecs.

Earlier, we argued that measuring PSNR is unsuitable

to assess effects on downstream prediction. Since the image

codecs are entirely decoupled from the predictive task, the

bitrate is identical regardless of the backbone. We use this

opportunity to plot PSNR instead of bpp against predictive

loss in Figure 14.

Considering that compression models aggressively dis-

card information, it is intuitive that the predictive loss is

comparable across backbones. While some models handle

distorted samples better, the difference in predictive loss is

at most 3-5%. Still, the discrepancy demonstrates that PSNR

is not a suitable measure for downstream tasks even within

the same codec. More importantly, the discrepancy across

baselines is considerably wider. For example, it is around

10% between Minnen-MSHP and Chen-BJHAP for lower

PSNR levels.

26 28 30 32 34

Ballé-FP

Backbone

ConvNeXt-B

ConvNeXt-S

ConvNeXt-T

ResNet-101

ResNet-152

ResNet-50

Swin-B

Swin-S

Swin-T

28 30 32 34 36

Ballé-SHP

Backbone

ConvNeXt-B

ConvNeXt-S

ConvNeXt-T

ResNet-101

ResNet-152

ResNet-50

Swin-B

Swin-S

Swin-T

28 30 32 34 36

Minnen-MSHP

Backbone

ConvNeXt-B

ConvNeXt-S

ConvNeXt-T

ResNet-101

ResNet-152

ResNet-50

Swin-B

Swin-S

Swin-T

28 30 32 34 36

Minnen-JHAP

Backbone

ConvNeXt-B

ConvNeXt-S

ConvNeXt-T

ResNet-101

ResNet-152

ResNet-50

Swin-B

Swin-S

Swin-T

28 30 32 34 36

Cheng-JHAP

Backbone

ConvNeXt-B

ConvNeXt-S

ConvNeXt-T

ResNet-101

ResNet-152

ResNet-50

Swin-B

Swin-S

Swin-T

29 31 33 35 37

Chen-BJHAP

Backbone

ConvNeXt-B

ConvNeXt-S

ConvNeXt-T

ResNet-101

ResNet-152

ResNet-50

Swin-B

Swin-S

Swin-T

PSNR [dB]

Pred. Loss [%]

Fig. 14: Predictive Loss of baselines on multiple Backbones

6.3.5 Blueprints Generalization to Arbitrary Backbones

We now evaluate the r-d performance of other implementa-

tions of FrankenSplit to determine whether the blueprint

heuristics generalize to arbitrary architectures. We create

a decoder blueprint (Section 5.3.3) for each of the three

architectural families (Swin, ResNet, and ConvNeXt). Then,

we perform bottleneck injection at the layers before the

deepest stage (Section 5.3.2), Figure 15 plots r-d performance

of directly related architectures sharing the correspond-

ing blueprint but with separately trained compressors. All

models are trained as described in Figure 7. Across all

architectural families, we observe similar r-d performance.

The (near) lossless conﬁgurations of the largest backbones

(Swin-B, ConvNeXt-B, ResNet-152) require around the same

bpp, whereas smaller models tend to require 3-4% more bpp

for comparable predictive loss.

Next, we conduct experiments to determine the impor-

tance of ﬁnding an adequate blueprint but assigning mis-

matching instances to a backbone. Table 4 summarizes the

This article has been accepted for publication in IEEE Transactions on Mobile Computing. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/TMC.2024.3381952

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

0.5 1.0 1.5

0.0

0.4

0.8

1.2

1.6

ConvNeXts

Backbone (Teacher)

ConvNeXt-B

ConvNeXt-S

ConvNeXt-T

0.5 1.0 1.5

0.0

0.4

0.8

1.2

1.6

ResNets

Backbone (Teacher)

ResNet-152

ResNet-101

ResNet-50

0.5 1.0 1.5

0.0

0.4

0.8

1.2

1.6

Swins

Backbone (Teacher)

Swin-B

Swin-S

Swin-T

0.25 0.75 1.25

0.0

0.4

0.8

1.2

1.6

ConvNeXts

Backbone (Teacher)

ConvNeXt-B

ConvNeXt-S

ConvNeXt-T

0.25 0.75 1.25

0.0

0.4

0.8

1.2

1.6

ResNets

Backbone (Teacher)

ResNet-152

ResNet-101

ResNet-50

0.25 0.75 1.25

0.0

0.4

0.8

1.2

1.6

Swins

Backbone (Teacher)

Swin-B

Swin-S

Swin-T

Without Saliency Guidance

With Saliency Guidance

bits per pixel (bpp)

Pred. Loss [%]

Fig. 15: Rate-distortion curve for various backbones

results for the largest backbones with varying decoder sizes.

The Swin blueprint for the Swin-B decoder results in the

FrankenSplit implementation from FS-FP from Figure 13.

With 1% overhead in parameters, the compressor achieves

5.08 kB for 0.40% predictive loss. However, once we train

compressors with ResNet or ConvNeXt restoration blocks,

the r-d performance for the Swin-B is signiﬁcantly worse

when overhead is roughly equal. A blueprint that performs

TABLE 4: Effect of Mismatching Blueprints

Blueprint-Backbone Params Overhead (%) File Size (kB) Pred. Loss (%)

19.05 2.49

0.96 15.07 3.00

14.46 2.53

2.86 11.37 2.74

12.53 1.36

ConvNext-Swin

5.89 10.08 2.09

22.54 0.82

1.03 18.19 0.99

16.32 0.81

2.73 12.68 0.98

13.89 0.79

ResNet-Swim

5.25 10.01 0.98

well for its intended target architecture results in substan-

tially worse r-d performance for other architecture. Only in-

creasing the decoder size brings the r-d performance closer

to conﬁgurations that apply the appropriate blueprint.

From our ﬁndings, we can draw several conclusions. The

r-d performance regarding the backbone network is near-

agnostic. The implication is that the information content of

the teachers (i.e., shallow layers) of varying architectures

is comparable. We explain this by considering that we

select the shallow layers as all layers preceding the deepest

stage, which have comparable parameters across varying

architectures.

Additionally, choosing a decoder architecture with the

correct inductive bias (i.e., a blueprint) can transform com-

pressed features signiﬁcantly more efﬁciently.

6.3.6 Single Encoder with Multiple Backbones

We conduct a similar experiment as head rerouting from

Section 5.3.4. However, we ﬁnetune the decoders instead of

the head models.

We ﬁrst select the compressors with (a near) lossless

prediction from Figure 15 for each architectural family.

Then, we choose the encoder from one of the compressors

corresponding to the largest variants. Finally, we attach the

decoders from the other compressors and ﬁnetune their

parameters. We use unweighted head distillation and cross

entropy (between the backbone classiﬁer outputs and the

hard labels) as the loss function. Analogous to the experi-

ment in Section 5.3.4, we set the batch size as 128 and use

PyTorch’s Adam optimizer with a learning rate of 7·10−5. To

102103104105

Top-1 Acc [%]

Teacher Encoder: ConvNeXt-B

Dec-Backbone

ConvNeXt-S

ConvNeXt-T

Resnet152

Resnet101

Swin-B

Swin-S

Swin-T

102103104105

Iterations

Teacher Encoder: Resnet152

Dec-Backbone

ConvNeXt-B

ConvNeXt-S

ConvNeXt-T

Resnet101

Resnet50

Swin-B

Swin-S

Swin-T

102103104105

Teacher Encoder: Swin-B

Dec-Backbone

ConvNeXt-B

ConvNeXt-S

ConvNeXt-T

Resnet152

Resnet101

Swin-S

Swin-T

Fig. 16: Iterations to recover accuracy with decoder

demonstrate the limited importance of the initial teacher, we

repeat this process for each of the three encoders separately

and summarize the results in Figure 16. Note that the bitrate

does not change due to freezing the encoder parameters.

Hence, we report iterations until accuracy is restored to

exemplify the similarity to the rerouting experiment in

Section 5.3.4. We consider an accuracy restored if it is within

0.25 ±0.25% of its original accuracy.

Besides requiring more iterations for convergence, the

results are unsurprisingly similar to the head routing ex-

periment outlined in Figure 12. Since we can infer from the

earlier results that decoders can sufﬁciently approximate the

head output, ﬁnetuning the decoder is near-equivalent to

ﬁnetuning a head.

6.3.7 Generalization to multiple Downstream Tasks

Arguably, SVBI naturally generalizes to multiple down-

stream tasks due to approximating shallow features. We pro-

vide empirical evidence by evaluating the r-d performance

of the compressors from Figure 13 without retraining the

weights on different datasets.

Speciﬁcally, attach separate classiﬁers to the Swin-B

backbone (as illustrated in Figure 10). Using PyTorch’s

Adam optimizer, we train each classiﬁer for ﬁve epochs with

no augmentation, a learning rate of 5·10−5. A classiﬁer

refers to the last layers of a network.

For FrankenSplit-(SG)FP, we applied none or only rudi-

mentary augmentation to evaluate how our method han-

dles a type of noise it did not encounter during train-

ing. Hence, we include the Food-101 [68] dataset since it

contains noise in high pixel intensities. Additionally, we

include CIFAR-100 [42]. Lastly, we include Flower-102 [69]

datasets to contrast more challenging tasks. The classiﬁers

achieve an 87.73%,88.01%, and 89.00% top-1 accuracy,

respectively. Figure 17 summarizes the r-d curves for each

task. Our method still demonstrates clear r-d performance

gains over the baselines. More importantly, notice how FS-

SGFP outperforms FS-FP on the r-d curve for the Food-

101 dataset, with a comparable margin to the ImageNet

This article has been accepted for publication in IEEE Transactions on Mobile Computing. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/TMC.2024.3381952

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

bits per pixel (bpp)

Pred. Loss [%]

CIFAR-100

WebP JPEG Ballé-FP Ballé-SHP Chen-BJHAP Cheng-JHAP FrankenSplit-FP FrankenSplit-SGFP Minnen-JHAP Minnen-MSHP

0.0 0.5 1.0 1.5 2.0

bits per pixel (bpp)

Pred. Loss [%]

Flower-102

0.0 0.5 1.0 1.5 2.0

bits per pixel (bpp)

Pred. Loss [%]

Food-101

Fig. 17: Rate-distortion curve for multiple downstream tasks

dataset. Contrarily, on the Flower-102 datasets, there is less

performance difference. Presumably, on simple datasets, the

suboptimality of HD is less signiﬁcant. Considering how

easier tasks require less model capacity, the diminishing

efﬁcacy saliency guidance is consistent with our claims from

Section 4.

6.3.8 Effect of Tensor Dimensionality on R-D Performance

Section 3.3 argues that measuring tensor dimensionality is

inadequate to assess whether partial execution on the client

is worthwhile.

To verify, we implement and train additional instances

of FrankenSplit with the Swin-B backbone and show re-

sults in Figure 18 FS-SGFP(S) is the model with a small

0.4 0.6 0.8 1.0

bit per pixel (bpp)

0.0

0.5

1.0

1.5

Predictive Loss [%]

Effect of Model Size on R-D

FG-SGFP (L)

FG-SGFP (M)

FG-SGFP (S)

6.0 6.5 7.0

File Size (kB)

40000

50000

60000

70000

80000

90000

100000

Enc. Latent Dimension

Latent Dimensionality

Fig. 18: Comparing effects on sizes

encoder (∼140′000 parameters) we have used for our previ-

ous results. FS-SGFP(M) and FS-SGFP(L) are medium and

large models where we increased the (output) channels

C= 48 to 96 and 128, respectively. Besides the number

of channels, we’ve trained the medium and large models

using the same conﬁgurations. On the left, we plot the r-

d curves showing that increasing encode capacity naturally

results in lower bitrates without additional predictive loss.

For the plot on the right, we train further models with

C={48,64,96,108,120,128}using the conﬁguration re-

sulting in lossless prediction. Notice how increasing output

channels will result in higher dimensional latent tensors

C×28 ×28 but inversely correlates to compressed ﬁle size.

Arguably, increasing the encoder capacity will yield more

powerful transforms to decorrelate the input.

0.2 0.3 0.4 0.5 0.6 0.7

Predictive Loss [%]

bit per pixel (bpp)

CIFAR-10

WebP

JPEG

SVBI-CE

SVBI-KD

1234

ImageNet

WebP

JPEG

SVBI-CE

SVBI-KD

Fig. 19: Contrasting the r-d performance

6.3.9 The Limitations of Direct Optimization for SVBI

Section 5.1 mentioned that direct optimization does not

work for SVBI as it does for DVBI, where the bottleneck is at

the penultimate layer. Speciﬁcally, it performs incomparably

worse than HD despite the latter’s inherent suboptimality.

We demonstrate this by applying the SVBI-CE and SVBI-

KD objective on the CIFAR-10 [42] and ImageNet dataset.

All models are identical and trained with the setup in

Section 6.1, except we train for more epochs to account for

slower convergence.

Figure 19 summarizes the results the results. On CIFAR-

10, SVBI-CE and SVBI-KD yield moderate performance

gain over JPEG. Yet, they perform substantially worse on

ImageNet.

Sufﬁciency as a necessary precondition may explain why

the objective in (5) does not yield good results when the

bottleneck is at a shallow layer, as the mutual information

I(Y;˜

Y)is not adequately high. Since the representation

of the last hidden and shallow layer are so far apart in

the information path, there is insufﬁcient information to

minimize D(H;˜

H). The compression model approximates

the intermediate representation for a simple classiﬁcation

task to minimize predictive loss by incurring higher bitrates.

Consequently, for the challenging ImageNet classiﬁcation

task, the same method incurs signiﬁcant predictive loss

even when skewing the r-d objective heavily towards high

bitrates.

This article has been accepted for publication in IEEE Transactions on Mobile Computing. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/TMC.2024.3381952

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

6.4 Prediction Latency and Overhead

We exclude entropy coding from our measurement, since

not all baselines use the same entropy coder. For brevity,

the results implicitly assume the Swin-B backbone for

the remainder of this section. Inference times with other

backbones for FrankenSplit can be derived from Table 5.

Analogously, the inference times of applying LIC models

TABLE 5: Execution Times of FS (S) with Various Backbones

Backbone Overhead Prams

(%)

Inf. Server+NX

(m/s)

Inf. Server+TX2

(m/s)

Swin-T 2.51 7.83 9.75

Swin-S 1.41 11.99 13.91

Swin-B 1.00 16.12 18.04

ConvNeXt-T 3.46 6.83 8.75

ConvNeXt-S 1.97 8.50 10.41

ConvNeXt-B 0.90 9.70 11.62

ResNet-50 3.50 13.16 10.05

ResNet-101 2.01 8.13 15.08

ResNet-152 1.48 18.86 20.78

for different unmodiﬁed backbones can be derived using

Table 2. Notably, the relative overhead decreases the larger

the tail is, which is favorable since we target inference from

more accurate predictors.

6.4.1 Computational Overhead

We ﬁrst disregard network conditions to get an overview of

the computational overhead of applying compression mod-

els. Table 6 summarizes the execution times of the prediction

pipeline’s components. Enc. NX/TX2 refers to the encoding

TABLE 6: Inference Pipeline Components Execution Times

Model Prams

Enc./Dec.

Enc. [NX/TX2]

(ms)

Dec.

(ms)

Full [NX/TX2]

(ms)

FrankenSplit 0.14M/

2.06M

2.92/

4.87 2.00 16.34/

18.29

Ball´

e-FP 3.51M/

3.51M

27.27/

48.93 1.30 41.71/

63.37

Ball´

e-SHP 8.30M/

5.90M

28.16/

50.89 1.51 42.81/

65.54

Minnen-MSHP 14.04M/

11.65M

29.51/

52.39 1.52 44.17/

67.05

Minnen-JHAP 21.99M/

19.59M

4128.17/

4789.89 275.18 4416.7/

5078.2

Cheng-JHAP 16.35M/

22.27M

2167.34/

4153.95 277.26 2457.7/

4444.3

Lu-JHAP 5.28M/

4.37M

2090.88/

5011.56 352.85 2456.8/

5377.8

Chen-BJHAP 36.73M/

28.08M

3111.01/

5837.38 43.16 3167.3/

5893.6

time on the respective client device. Analogously, dec. refers

to the decoding time at the server. Lastly, Full NX/TX2 is

the total execution time of encoding at the respective client

plus decoding and the prediction task at the server. Lu-

JHAP demonstrates how LIC models without a sequential

context component are noticeably faster but are still 9.3x-

9.6x slower than FrankenSplit despite a considerably worse

r-d performance. Notice that the computational load of

FrankenSplit is near evenly distributed between the client

and the server. The signiﬁcance of considering resource

asymmetry is emphasized by how the partially parallelized

context model of Chen-BJHAP leads to faster decoding

on the server. Nevertheless, it is slower than other JHAP

baselines due to the overhead of the increased encoder size

outweighing the performance gain of the blocked context

model on constrained hardware.

6.4.2 Competing against Ofﬂoading

The average compressed ﬁlesize gives the transfer size from

the ImageNet validation set. Using the transfer size, we

TABLE 7: Total Latency with Various Wireless Standards

Standard/

Data Rate

(Mbps)

codec Transfer

(ms)

Total [TX2]

(ms)

Total [NX]

(ms)

FS-SGFP (0.23) 142.59 160.48 158.53

FS-SGFP (LL) 209.89 227.78 225.83

Minnen-MSHP 348.85 415.89 393.01

Chen-BJHAP 40.0 6167.79 3441.41

WebP 865.92 879.06 879.06

BLE/

0.27

PNG 2532.58 2545.72 2545.72

FS-SGFP (0.23) 3.21 21.09 19.15

FS-SGFP (LL) 4.72 22.61 20.66

Minnen-MSHP 7.85 74.89 52.01

Chen-BJHAP 0.9 6128.69 3402.31

WebP 19.48 32.63 32.63

4G/

12.0

PNG 56.98 70.13 70.13

FS-SGFP (0.23) 0.71 18.6 16.65

FS-SGFP (LL) 1.05 18.93 16.99

Minnen-MSHP 1.74 68.78 45.9

Chen-BJHAP 0.2 6127.99 3401.61

WebP 4.33 17.47 17.47

Wi-Fi/

54.0

PNG 12.66 25.81 25.81

FS-SGFP (0.23) 0.58 18.46 16.51

FS-SGFP (LL) 0.85 18.73 16.78

Minnen-MSHP 1.41 68.44 45.56

5G/

66.9

Chen-BJHAP 0.16 6127.95 3401.57

WebP 3.49 16.64 16.64

PNG 10.22 23.36 23.36

evaluate transfer time on a broad range of standards. Since

we did not include the execution time of entropy coding

for learned methods, the encoding and decoding time for

the handcrafted codecs is set to 0. The setting favors the

baselines because both rely on sequential CPU-bound trans-

forms. Table 7 summarizes how our method performs in

various standards. Due to space constraints, we only include

LIC models with the lowest request latency (Minnen-MSHP)

or the lowest compression rate (Chen-BJHAP). Still, with

Table 6 and the previous results, we can infer that the LIC

baselines have considerably higher latency than Franken-

Split.

Generally, the more constrained the network is the more

we can beneﬁt from reducing the transfer size. In particu-

lar, FrankenSplit is up to 16x faster in highly constrained

networks, such as BLE. Conversely, ofﬂoading with fast

handcrafted codecs may be preferable in high-bandwidth

environments. Yet, FrankenSplit is signiﬁcantly better than

ofﬂoading with PNG, even for 5G. Figure 20 plots the

inference latencies against handcrafted codecs using the

NX client. For stronger connections, such as 4G LTE, it is

still 3.3x faster than using PNG. Nevertheless, compared to

WebP, ofﬂoading seems more favorable when bandwidth

is high. Still, this assumes that the rates do not ﬂuctuate

and that the network can seamlessly scale for an arbitrary

number of client connections. Moreover, we did not apply

any optimizations to the encoder.

7 CONCLUSION

This work introduced a novel lightweight compression

framework to facilitate critical MEC applications relying on

large DNNs. We showed that a minimalistic implementation

This article has been accepted for publication in IEEE Transactions on Mobile Computing. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/TMC.2024.3381952

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

10 20 30 40 50 60 70

Data rate (Mbps)

Total Latency (ms)

FS-SGFP (LL) vs. PNG

10 20 30 40 50 60 70

Data rate (Mbps)

Total Latency (ms)

FS-SGFP (-0.40) vs. PNG

10 20 30 40 50 60 70

Data rate (Mbps)

Total Latency (ms)

FS-SGFP (LL) vs. WebP

10 20 30 40 50 60 70

Data rate (Mbps)

Total Latency (ms)

FS-SGFP (-0.40) vs. WebP

FrankenSplit (FS) Offloading with Codec

Fig. 20: Comparing effects on sizes

of our design heuristic is sufﬁcient to outperform numer-

ous baselines. However, there are several limitations. We

emphasize that the primary insight of the reported results

is the potential of adequate distortion measures and regu-

larization methods for neural feature compression. Despite

signiﬁcantly improving rate-distortion performance, better

methods may exist to extract saliency maps. Moreover,

the Factorized Prior entropy model does not discriminate

between inputs. Although side information with hypernet-

works taken from LIC trivially improves rate-distortion per-

formance, our results show that it may not be a productive

approach to repurpose existing image compression methods

directly. Hence, conceiving an efﬁcient way to include task-

dependent side information is a promising direction.

REFERENCES

[1] A. Voulodimos, N. Doulamis, A. Doulamis, E. Protopapadakis,

et al., “Deep learning for computer vision: A brief review,” Com-

putational intelligence and neuroscience, vol. 2018, 2018.

[2] D. W. Otter, J. R. Medina, and J. K. Kalita, “A survey of the

usages of deep learning for natural language processing,” IEEE

transactions on neural networks and learning systems, vol. 32, no. 2,

pp. 604–624, 2020.

[3] C. Feng, P. Han, X. Zhang, B. Yang, Y. Liu, and L. Guo, “Compu-

tation ofﬂoading in mobile edge computing networks: A survey,”

Journal of Network and Computer Applications, p. 103366, 2022.

[4] T. Rausch, W. Hummer, C. Stippel, S. Vasiljevic, C. Elvezio,

S. Dustdar, and K. Kr¨

osl, “Towards a platform for smart city-

scale cognitive assistance applications,” in 2021 IEEE Conference

on Virtual Reality and 3D User Interfaces Abstracts and Workshops

(VRW), pp. 330–335, 2021.

[5] R. R. Arinta and E. Andi W.R., “Natural disaster application on

big data and machine learning: A review,” in 2019 4th International

Conference on Information Technology, Information Systems and Elec-

trical Engineering (ICITISEE), pp. 249–254, 2019.

[6] Q. Xin, M. Alazab, V. G. D´

ıaz, C. E. Montenegro-Marin, and R. G.

Crespo, “A deep learning architecture for power management in

smart cities,” Energy Reports, vol. 8, pp. 1568–1577, 2022.

[7] U. Cisco, “Cisco annual internet report (2018–2023) white paper,”

Cisco: San Jose, CA, USA, vol. 10, no. 1, pp. 1–35, 2020.

[8] Q. Zhang, X. Li, X. Che, X. Ma, A. Zhou, M. Xu, S. Wang,

Y. Ma, and X. Liu, “A comprehensive benchmark of deep learn-

ing libraries on mobile devices,” in Proceedings of the ACM Web

Conference 2022, pp. 3298–3307, 2022.

[9] Q. Zhang, X. Che, Y. Chen, X. Ma, M. Xu, S. Dustdar, X. Liu, and

S. Wang, “A comprehensive deep learning library benchmark and

optimal library selection,” IEEE Transactions on Mobile Computing,

2023.

[10] F. Romero, Q. Li, N. J. Yadwadkar, and C. Kozyrakis, “INFaaS:

Automated model-less inference serving,” in 2021 USENIX An-

nual Technical Conference (USENIX ATC 21), pp. 397–411, USENIX

Association, July 2021.

[11] Y. Matsubara, M. Levorato, and F. Restuccia, “Split computing and

early exiting for deep learning applications: Survey and research

challenges,” ACM Computing Surveys, vol. 55, no. 5, pp. 1–30, 2022.

[12] N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck

method,” 2000.

[13] Y. Yang, S. Mandt, and L. Theis, “An introduction to neural data

compression,” 2022.

[14] G. LLC, “An image format for the web.”

[15] V. Goyal, “Theoretical foundations of transform coding,” IEEE

Signal Processing Magazine, vol. 18, no. 5, pp. 9–21, 2001.

[16] J. Ball´

e, P. A. Chou, D. Minnen, S. Singh, N. Johnston, E. Agusts-

son, S. J. Hwang, and G. Toderici, “Nonlinear transform coding,”

CoRR, vol. abs/2007.03034, 2020.

[17] J. Ball´

e, V. Laparra, and E. P. Simoncelli, “End-to-end optimized

image compression,” in 5th International Conference on Learning

Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Con-

ference Track Proceedings, OpenReview.net, 2017.

[18] J. Ball´

e, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston,

“Variational image compression with a scale hyperprior,” 2018.

[19] D. Minnen, J. Ball´

e, and G. Toderici, “Joint autoregressive and

hierarchical priors for learned image compression,” 2018.

[20] S. Singh, S. Abu-El-Haija, N. Johnston, J. Ball´

e, A. Shrivastava, and

G. Toderici, “End-to-end learning of compressible features,” CoRR,

vol. abs/2007.11797, 2020.

[21] Y. Dubois, B. Bloem-Reddy, K. Ullrich, and C. J. Maddison, “Lossy

compression for lossless prediction,” Advances in Neural Informa-

tion Processing Systems, vol. 34, pp. 14014–14028, 2021.

[22] Y. Matsubara, R. Yang, M. Levorato, and S. Mandt, “SC2 Bench-

mark: Supervised Compression for Split Computing,” Transactions

on Machine Learning Research, 2023.

[23] Y. Matsubara, R. Yang, M. Levorato, and S. Mandt, “Supervised

compression for resource-constrained edge computing systems,”

in Proceedings of the IEEE/CVF Winter Conference on Applications of

Computer Vision, pp. 2685–2695, 2022.

[24] Y. Kang, J. Hauswald, C. Gao, A. Rovinski, T. Mudge, J. Mars, and

L. Tang, “Neurosurgeon: Collaborative intelligence between the

cloud and mobile edge,” ACM SIGARCH Computer Architecture

News, vol. 45, no. 1, pp. 615–629, 2017.

[25] H. Li, C. Hu, J. Jiang, Z. Wang, Y. Wen, and W. Zhu, “Jalad:

Joint accuracy-and latency-aware deep structure decoupling for

edge-cloud execution,” in 2018 IEEE 24th international conference on

parallel and distributed systems (ICPADS), pp. 671–678, IEEE, 2018.

[26] S. Laskaridis, S. I. Venieris, M. Almeida, I. Leontiadis, and N. D.

Lane, “Spinn: synergistic progressive inference of neural networks

over device and cloud,” in Proceedings of the 26th annual interna-

tional conference on mobile computing and networking, pp. 1–15, 2020.

[27] M. Almeida, S. Laskaridis, S. I. Venieris, I. Leontiadis, and N. D.

Lane, “Dyno: Dynamic onloading of deep neural networks from

cloud to device,” ACM Transactions on Embedded Computing Sys-

tems, vol. 21, no. 6, pp. 1–24, 2022.

[28] H. Liu, W. Zheng, L. Li, and M. Guo, “Loadpart: Load-aware

dynamic partition of deep neural networks for edge ofﬂoading,”

in 2022 IEEE 42nd International Conference on Distributed Computing

Systems (ICDCS), pp. 481–491, 2022.

[29] A. Bakhtiarnia, N. Miloˇ

sevi´

c, Q. Zhang, D. Bajovi´

c, and A. Iosi-

ﬁdis, “Dynamic split computing for efﬁcient deep edge intelli-

gence,” in ICASSP 2023 - 2023 IEEE International Conference on

Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5, 2023.

[30] A. E. Eshratifar, A. Esmaili, and M. Pedram, “Bottlenet: A deep

learning architecture for intelligent mobile cloud computing ser-

vices,” in 2019 IEEE/ACM International Symposium on Low Power

Electronics and Design (ISLPED), pp. 1–6, 2019.

[31] J. Shao and J. Zhang, “Bottlenet++: An end-to-end approach for

feature compression in device-edge co-inference systems,” in 2020

IEEE International Conference on Communications Workshops (ICC

Workshops), pp. 1–6, 2020.

[32] Y. Matsubara, S. Baidya, D. Callegaro, M. Levorato, and S. Singh,

“Distilled split deep neural networks for edge-assisted real-time

systems,” in Proceedings of the 2019 Workshop on Hot Topics in Video

Analytics and Intelligent Edges, HotEdgeVideo’19, (New York, NY,

USA), p. 21–26, Association for Computing Machinery, 2019.

[33] M. Sbai, M. R. U. Saputra, N. Trigoni, and A. Markham, “Cut, distil

and encode (cde): Split cloud-edge deep inference,” in 2021 18th

This article has been accepted for publication in IEEE Transactions on Mobile Computing. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/TMC.2024.3381952

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

Annual IEEE International Conference on Sensing, Communication,

and Networking (SECON), pp. 1–9, 2021.

[34] L. Deng, G. Li, S. Han, L. Shi, and Y. Xie, “Model compression

and hardware acceleration for neural networks: A comprehensive

survey,” Proceedings of the IEEE, vol. 108, no. 4, pp. 485–532, 2020.

[35] Z. Li, F. Liu, W. Yang, S. Peng, and J. Zhou, “A survey of convo-

lutional neural networks: analysis, applications, and prospects,”

IEEE transactions on neural networks and learning systems, 2021.

[36] K. Han, Y. Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y. Tang,

A. Xiao, C. Xu, Y. Xu, et al., “A survey on vision transformer,”

IEEE transactions on pattern analysis and machine intelligence, vol. 45,

no. 1, pp. 87–110, 2022.

[37] F. Romero, Q. Li, N. J. Yadwadkar, and C. Kozyrakis, “Infaas:

Automated model-less inference serving.,” in USENIX Annual

Technical Conference, pp. 397–411, 2021.

[38] K. Zhao, Z. Zhou, X. Chen, R. Zhou, X. Zhang, S. Yu, and D. Wu,

“Edgeadaptor: Online conﬁguration adaption, model selection

and resource provisioning for edge dnn inference serving at scale,”

IEEE Transactions on Mobile Computing, pp. 1–16, 2022.

[39] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep

network training by reducing internal covariate shift,” in Interna-

tional conference on machine learning, pp. 448–456, pmlr, 2015.

[40] S. Xie, R. Girshick, P. Doll´

ar, Z. Tu, and K. He, “Aggregated

residual transformations for deep neural networks,” in Proceedings

of the IEEE conference on computer vision and pattern recognition,

pp. 1492–1500, 2017.

[41] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,

Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., “Imagenet

large scale visual recognition challenge,” International journal of

computer vision, vol. 115, no. 3, pp. 211–252, 2015.

[42] A. Krizhevsky, G. Hinton, et al., “Learning multiple layers of

features from tiny images,” 2009.

[43] R. Shwartz-Ziv and N. Tishby, “Opening the black box of deep

neural networks via information,” arXiv preprint arXiv:1703.00810,

2017.

[44] C. E. Shannon, “Coding theorems for a discrete source with a

ﬁdelity criterion,” in IRE National Convention Record, 1959, vol. 4,

pp. 142–163, 1959.

[45] T. Berger, “Rate distortion theory for sources with abstract alpha-

bets and memory,” Information and Control, vol. 13, no. 3, pp. 254–

273, 1968.

[46] N. Tishby and N. Zaslavsky, “Deep learning and the information

bottleneck principle,” in 2015 ieee information theory workshop (itw),

pp. 1–5, IEEE, 2015.

[47] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a

neural network,” 2015.

[48] L. Wang and K.-J. Yoon, “Knowledge distillation and student-

teacher learning for visual intelligence: A review and new out-

looks,” IEEE Transactions on Pattern Analysis and Machine Intelli-

gence, 2021.

[49] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and

Y. Bengio, “Fitnets: Hints for thin deep nets,” arXiv preprint

arXiv:1412.6550, 2014.

[50] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba,

“Learning deep features for discriminative localization,” in Pro-

ceedings of the IEEE conference on computer vision and pattern recogni-

tion, pp. 2921–2929, 2016.

[51] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and

D. Batra, “Grad-CAM: Visual explanations from deep networks

via gradient-based localization,” International Journal of Computer

Vision, vol. 128, pp. 336–359, oct 2019.

[52] J. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, “Striv-

ing for simplicity: The all convolutional net,” in ICLR (workshop

track), 2015.

[53] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learniserverng

for image recognition,” CoRR, vol. abs/1512.03385, 2015.

[54] S. Zagoruyko and N. Komodakis, “Wide residual networks,” arXiv

preprint arXiv:1605.07146, 2016.

[55] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo,

“Swin transformer: Hierarchical vision transformer using shifted

windows,” in Proceedings of the IEEE/CVF International Conference

on Computer Vision, pp. 10012–10022, 2021.

[56] J. Liang, J. Cao, G. Sun, K. Zhang, L. Van Gool, and R. Timofte,

“Swinir: Image restoration using swin transformer,” in Proceed-

ings of the IEEE/CVF International Conference on Computer Vision,

pp. 1833–1844, 2021.

[57] R. Wightman, “Pytorch image models.”

https://github.com/rwightman/pytorch-image-models, 2019.

[58] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-

tion,” arXiv preprint arXiv:1412.6980, 2014.

[59] A. D. I. Pytorch, “Pytorch,” 2018.

[60] J. B´

egaint, F. Racap´

e, S. Feltman, and A. Pushparaja, “Compressai:

a pytorch library and evaluation platform for end-to-end compres-

sion research,” arXiv preprint arXiv:2011.03029, 2020.

[61] J. Gildenblat and contributors, “Pytorch library for cam methods.”

https://github.com/jacobgil/pytorch-grad-cam, 2021.

[62] Y. Matsubara, “torchdistill: A Modular, Conﬁguration-Driven

Framework for Knowledge Distillation,” in International Workshop

on Reproducible Research in Pattern Recognition, pp. 24–44, Springer,

2021.

[63] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie,

“A convnet for the 2020s,” in Proceedings of the IEEE/CVF Confer-

ence on Computer Vision and Pattern Recognition, pp. 11976–11986,

2022.

[64] Z. Cheng, H. Sun, M. Takeuchi, and J. Katto, “Learned image

compression with discretized gaussian mixture likelihoods and

attention modules,” 2020.

[65] T. Chen, H. Liu, Z. Ma, Q. Shen, X. Cao, and Y. Wang, “End-to-end

learnt image compression via non-local attention optimization and

improved context modeling,” IEEE Transactions on Image Process-

ing, vol. 30, pp. 3179–3191, 2021.

[66] M. Lu, P. Guo, H. Shi, C. Cao, and Z. Ma, “Transformer-based

image compression,” in 2022 Data Compression Conference (DCC),

pp. 469–469, 2022.

[67] X. Luo, H. Talebi, F. Yang, M. Elad, and P. Milanfar, “The

rate-distortion-accuracy tradeoff: Jpeg case study,” arXiv preprint

arXiv:2008.00605, 2020.

[68] L. Bossard, M. Guillaumin, and L. Van Gool, “Food-101–mining

discriminative components with random forests,” in Computer

Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland,

September 6-12, 2014, Proceedings, Part VI 13, pp. 446–461, Springer,

2014.

[69] M.-E. Nilsback and A. Zisserman, “Automated ﬂower classiﬁca-

tion over a large number of classes,” in 2008 Sixth Indian Conference

on Computer Vision, Graphics & Image Processing, pp. 722–729, 2008.

Alireza Furutanpey received a MSc from the

Technical University of Vienna, Austria in 2022

with distinction in the ﬁeld of Computer Sci-

ence. He is now a PhD candidate at the Dis-

tributed Systems Group in the ﬁeld of Edge

Computing. His research interests include Mo-

bile Edge Computing, Edge Intelligence and Ma-

chine Learning.

Philipp Raith received a MSc from the Technical

University of Vienna, Austria in 2021 with distinc-

tion in the ﬁeld of Computer Science. He is now a

PhD candidate at the Distributed Systems Group

in the ﬁeld of Edge Computing. His research

interests include Serverless Edge Computing,

Edge Intelligence and Operations for AI.

Schahram Dustdar is a full professor of com-

puter science and heads TU Wien’s Distributed

Systems Group. His research interests include

distributed systems, Edge Intelligence, complex

and autonomic software systems. He’s the ed-

itor in chief of Computing; associate editor of

ACM Transactions on the Web, ACM Transac-

tions on Internet Technology, IEEE Transactions

on Cloud Computing, and IEEE Transactions on

Services Computing. He’s also on the editorial

boards of IEEE Internet Computing and IEEE

Computer. He has received the ACM Distinguished Scientist award and

Distinguished Speaker Award and the IBM Faculty Award. He is an

elected member of Academia Europaea, where he’s was Informatics

Section chairman from 2015 to 2022. He is an IEEE Fellow and AAIA

Fellow where he is the current President.

This article has been accepted for publication in IEEE Transactions on Mobile Computing. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/TMC.2024.3381952

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

ResearchGate has not been able to resolve any citations for this publication.

A Comprehensive Deep Learning Library Benchmark and Optimal Library Selection

Article

Jan 2023

Deploying deep learning (DL) on mobile devices has been a notable trend in recent years. To support fast inference of on-device DL, DL libraries play a critical role as algorithms and hardware do. Unfortunately, no prior work ever dives deep into the ecosystem of modern DL libraries and provides quantitative results on their performance. In this paper, we first build a comprehensive benchmark that includes 6 representative DL libraries and 15 diversified DL models. Then we perform extensive experiments on 10 mobile devices, and the results reveal the current landscape of mobile DL libraries. For example, we find that the best-performing DL library is severely fragmented across different models and hardware, and the gap between DL libraries can be rather huge. In fact, the impacts of DL libraries can overwhelm the optimizations from algorithms or hardware, e.g., model quantization and GPU/DSP-based heterogeneous computing. Motivated by the fragmented performance of DL libraries across models and hardware, we propose an effective DL Library selection framework to obtain the optimal library on a new dataset that has been created. We evaluate the DL Library selection algorithm, and the results show that the framework at it can improve the prediction accuracy by about 10% than benchmark approaches on average.

A deep learning architecture for power management in smart cities

Article

Jan 2022

Sustainable energy management is an inexpensive approach for improved energy use. However, the research used does not focus on cutting-edge technology possibilities in an Internet of things (IoT). This paper includes the needs for today's distributed generation, households, and industries in proposing smart resource management deep learning model. A deep learning architecture of power management (DLA-PM) is presented in this article. It predicts future power consumption for a short period and provides effective communication between power distributors and customers. To keep power consumption and supply constant, mobile devices are linked to a universal IoT cloud server connected to the intelligent grids in the proposed design. An effective brief forecast decision-making method is followed by various preprocessing strategies to deal with electrical data. It conducts extensive tests with RMSE reduced by 0.08 for both residential and business data sources. Significant strengths include refined device-based, real-time energy administration via a shared cloud-based server data monitoring system, optimized selection of standardization technology, a new energy prediction framework, a learning process with decreased time, and lower error rates. In the proposed architecture, mobile devices link to a universal IoT cloud server communicating with the corresponding intelligent grids such that the power consumption and supply phenomena continually continue. It utilizes many preprocessing strategies to cope with the diversity of electrical data, follows an effective short prediction decision-making method, and executes it using resources. For residential and business data sources, it runs comprehensive trials with RMSE lowered by 0.08.

Dynamic Split Computing for Efficient Deep EDGE Intelligence

Conference Paper

Jun 2023

An Introduction to Neural Data Compression

Book

Jan 2023

The goal of data compression is to reduce the number of bits needed to represent useful information. Neural, or learned compression, is the application of neural networks and related machine learning techniques to this task. This monograph aims to serve as an entry point for machine learning researchers interested in compression by reviewing the prerequisite background and representative methods in neural compression. Neural compression is the application of neural networks and other machine learning methods to data compression. Recent advances in statistical machine learning have opened up new possibilities for data compression, allowing compression algorithms to be learned end-to-end from data using powerful generative models such as normalizing flows, variational autoencoders, diffusion probabilistic models, and generative adversarial networks. This monograph introduces this field of research to a broader machine learning audience by reviewing the necessary background in information theory (e.g., entropy coding, rate-distortion theory) and computer vision (e.g., image quality assessment, perceptual metrics), and providing a curated guide through the essential ideas and methods in the literature thus far. Instead of surveying the vast literature, essential concepts and methods in neural compression are covered, with a reader in mind who is versed in machine learning but not necessarily data compression.

LoADPart: Load-Aware Dynamic Partition of Deep Neural Networks for Edge Offloading

Conference Paper

Jul 2022

A ConvNet for the 2020s

Conference Paper

Jun 2022

EdgeAdaptor: Online Configuration Adaption, Model Selection and Resource Provisioning for Edge DNN Inference Serving at Scale

Article

Jan 2022

The accelerating convergence of artificial intelligence and edge computing has sparked a recent wave of interest in edge intelligence. While pilot efforts focused on edge DNN inference serving for a single user or DNN application, scaling edge DNN inference serving to multiple users and applications is however nontrivial. In this paper, we propose an online optimization framework EdgeAdaptor for multi-user and multi-application edge DNN inference serving at scale, which aims to navigate the three-way trade-off between inference accuracy, latency, and resource cost via jointly optimizing the application configuration adaption, DNN model selection and edge resource provisioning on-the-fly. The underlying long-term optimization problem is difficult since it is NP-hard and involves future uncertain information. To address these dual challenges, we fuse the power of online optimization and approximate optimization into a joint optimization framework, via i) decomposing the long-term problem into a series of single-shot fractional problems with a regularization technique, and ii) rounding the fractional solution to a near-optimal integral solution with a randomized dependent scheme. Rigorous theoretical analysis derives a parameterized competition ratio of our online algorithms, and extensive trace-driven simulations verify that its empirical value is no larger than 1.4 in typical scenarios.

Transformer-based Image Compression

Conference Paper

Mar 2022

A Comprehensive Benchmark of Deep Learning Libraries on Mobile Devices

Conference Paper

Apr 2022

Computation offloading in mobile edge computing networks: A survey

Article

Apr 2022

Computation offloading is one of the key technologies in Mobile Edge Computing (MEC), which makes up for the deficiencies of mobile devices in terms of storage resource, computing capacity, and energy efficiency. On one hand, computation offloading of task requests not only relieves the communication pressure on the core networks but also reduces the delay caused by long-distance data transmission. On the other hand, emerging applications in 5/6G also rely on the computation offloading technology for efficient service provisioning to users. At present, the industry and academia have conducted a lot of researches on the computation offloading methods in MEC networks with a diversity of meaningful techniques and approaches. In this paper, we present a comprehensive survey of the computation offloading in MEC networks including applications, offloading objectives, and offloading approaches. Particularly, we discuss key issues on various offloading objectives, including delay minimization, energy consumption minimization, revenue maximization, and system utility maximization. The approaches to achieve these objectives mainly include mathematical solver, heuristic algorithms, Lyapunov optimization, game theory, and Markov Decision Process (MDP) and Reinforcement Learning (RL). We compare the approaches by characterizing their pros and cons as well as targeting applications.Finally, from the four aspects of subtasks dependency and online task requests, server selection, real-time environment perception, and security, we analyze the current challenges and future directions of computation offloading in MEC networks.

FrankenSplit: Efficient Neural Feature Compression With Shallow Variational Bottleneck Injection for Mobile Edge Computing

Abstract

Recommended publications

FrankenSplit: Efficient Neural Feature Compression with Shallow Variational Bottleneck Injection for...

FOOL: Addressing the Downlink Bottleneck in Satellite Computing with Neural Feature Compression

Communication-Computation Trade-off in Resource-Constrained Edge Inference

Rate-Controllable and Target-Dependent JPEG-Based Image Compression Using Feature Modulation