Content uploaded by Naoto Inoue
Author content
All content in this area was uploaded by Naoto Inoue on Sep 09, 2023
Content may be subject to copyright.
LayoutDM: Discrete Diffusion Model for Controllable Layout Generation
Naoto Inoue1Kotaro Kikuchi1Edgar Simo-Serra2Mayu Otani1Kota Yamaguchi1
1CyberAgent, Japan 2Waseda University, Japan
{inoue naoto, kikuchi kotaro xa}@cyberagent.co.jp ess@waseda.jp
{otani mayu, yamaguchi kota}@cyberagent.co.jp
Abstract
Controllable layout generation aims at synthesizing
plausible arrangement of element bounding boxes with op-
tional constraints, such as type or position of a specific el-
ement. In this work, we try to solve a broad range of lay-
out generation tasks in a single model that is based on dis-
crete state-space diffusion models. Our model, named Lay-
outDM, naturally handles the structured layout data in the
discrete representation and learns to progressively infer a
noiseless layout from the initial input, where we model the
layout corruption process by modality-wise discrete diffu-
sion. For conditional generation, we propose to inject lay-
out constraints in the form of masking or logit adjustment
during inference. We show in the experiments that our Lay-
outDM successfully generates high-quality layouts and out-
performs both task-specific and task-agnostic baselines on
several layout tasks.1
1. Introduction
Graphic layouts play a critical role in visual communica-
tion. Automatically creating a visually pleasing layout has
tremendous application benefits that range from authoring
of printed media [45] to designing application user inter-
face [5], and there has been a growing research interest in
the community. The task of layout generation considers the
arrangement of elements, where each element has a tuple
of attributes, such as category, position, or size, and de-
pending on the task setup, there could be optional control
inputs that specify part of the elements or attributes. Due
to the structured nature of layout data, it is crucial to con-
sider relationships between elements in a generation. For
this reason, current generation approaches either build an
autoregressive model [2,11] or develop a dedicated infer-
ence strategy to explicitly consider relationships [19–21].
In this paper, we propose to utilize discrete state-space
1Please find the code and models at:
https://cyberagentailab.github.io/layout-dm.
...
Category
Generated layout(s)
LayoutDM
Category +sizeCompletion
Refinement Category+relationship
+
... ...
4 33 211...M M M M ...M M ...
corrupt
conditio
(optional)
denoise
Initial layout Final layout
Figure 1. Overview of LayoutDM. Top: LayoutDM is trained to
gradually generate a complete layout from a blank state in discrete
state space. Bottom: During sampling, we can steer LayoutDM
to perform various conditional generation tasks without additional
training or external models.
diffusion models [3,9,14] for layout generation tasks. Dif-
fusion models have shown promising performance for var-
ious generation tasks, including images and texts [13].
We formulate the diffusion process for layout structure by
modality-wise discrete diffusion, and train a denoising back-
bone network to progressively infer the complete layout
with or without conditional inputs. To support variable-
length layout data, we extend the discrete state-space with
a special PAD token instead of the typical end-of-sequence
token used in autoregressive models. Our model can incor-
porate complex layout constraints via logit adjustment, so
that we can refine an existing layout or impose relative size
constraints between elements without additional training.
We discuss two key advantages of LayoutDM over ex-
isting models for conditional layout generation. Our model
avoids the immutable dependency chain issue [20] that hap-
This CVPR paper is the Open Access version, provided by the Computer Vision Foundation.
Except for this watermark, it is identical to the accepted version;
the final published version of the proceedings is available on IEEE Xplore.
10167
pens in autoregressive models [11]. Autoregressive mod-
els fail to perform conditional generation when the con-
dition disagrees with the pre-defined generation order of
elements and attributes. Unlike non-autoregressive mod-
els [20], our model can generate variable-length elements.
We empirically show in Sec. 4.5 that naively extending
non-autoregressive models by padding results in suboptimal
variable length generation while padding combined with
our diffusion formulation leads to significant improvement.
We evaluate LayoutDM on various layout generation
tasks tackled by previous works [20,21,33,36] using two
large-scale datasets, Rico [5] and PubLayNet [45]. Lay-
outDM outperforms task-agnostic baselines in the major-
ity of cases and shows promising performance compared
with task-specific baselines. We further conduct an ablation
study to prove the significant impact of our design choices
in LayoutDM, including quantization of continuous vari-
ables and positional embedding.
We summarize our contributions as follows:
• We formulate the discrete diffusion process for layout
generation and propose a modality-wise diffusion and a
padding approach to model highly structured layout data.
• We propose to inject complex layout constraints via
masking and logit adjustment during the inference, so that
our model can solve diverse tasks in a single model.
• We empirically show solid performance for various con-
ditional layout generation tasks on public datasets.
2. Related Work
2.1. Layout Generation
Studies on automatic layout generation have appeared
several times in literature [1,26,32,42]. Layout tasks are
commonly observed in design applications, including mag-
azine covers, posters, presentation slides, application user
interface, or banner advertising [5,8,10,18,35,41,42,44].
Recent approaches to layout generation consider both un-
conditional generation [2,11,15,16] and conditional gener-
ation in various setups, such as conditional inputs of cate-
gory or size [19–21,23], relational constraints [19,21], el-
ement completion [11], and refinement [36]. Some attempt
at solving multiple tasks in a single model [20,33].
BLT [20] points out that the recent autoregressive de-
coders [2,11] are not fully capable of considering partial
inputs, i.e. known elements or attributes, during generation
because they have a fixed generation order. BLT addresses
the conditional generation by fill-in-the-blank task formu-
lation using a bidirectional Transformer encoder similar to
masked language models [6]. However, BLT cannot solve
layout completion demonstrated in the decoder-based mod-
els because of the requirement of the known number of el-
ements. Our LayoutDM enjoys the best of both worlds and
supports a broader range of conditional generation tasks in
a single model.
Another layout-specific consideration is the complex
user-specified constraints, such as the positional require-
ments between two boxes (e.g., a header box should be on
top of a paragraph box). Earlier approaches [31,32,43]
propose hand-crafted cost functions representing the vio-
lation degree of aesthetic constraints so that those con-
straints guide the optimization process of layout inference.
CLG-LO [19] proposes an aesthetically constrained opti-
mization framework for pre-trained GANs. Our LayoutDM
solves such constrained generation tasks on top of the task-
agnostic iterative prediction via logit adjustment.
2.2. Discrete Diffusion Models
Diffusion models [38] are generative models character-
ized by a forward and reverse Markov process. The for-
ward process corrupts the data into a sequence of increas-
ingly noisy variables. The reverse process gradually de-
noises the variables toward the actual data distribution. Dif-
fusion models are stable to train and achieve faster sampling
than autoregressive models by parallel iterative refinement.
Recently, many approaches have learned the reverse pro-
cess by a neural network and show strong empirical perfor-
mance [7,13,39] in continuous state spaces, such as images.
Discrete state spaces are a natural representation of dis-
crete variables, such as text. D3PM [3] extends the pioneer-
ing work of Hoogeboom et al. [14] to structured categor-
ical corruption processes for diffusion models in discrete
state spaces, while maintaining the advantages of diffusion
models for continuous state spaces. VQDiffusion [9] devel-
ops a corruption approach called mask-and-replace, so as
to avoid accumulated prediction errors that are common in
models based on iterative prediction. Following the corrup-
tion model of VQDiffusion, we carefully design a modality-
wise corruption process for layout tasks that involve tokens
from disjoint sets of vocabulary per modality.
Several studies consider a conditional input to the infer-
ence process of diffusion models. Some approaches alter
the reverse diffusion iteration to carefully inject given con-
ditions for free-form image inpainting [28] or image edit-
ing by strokes or composition [30]. We extend the discrete
state-space diffusion models via hard masking or logit ad-
justment to support the conditional generation of layouts.
3. LayoutDM
Our LayoutDM builds on discrete-state space diffusion
models [3,9]. We first briefly review the fundamental of
discrete diffusion models in Sec. 3.1. Sec. 3.2 explains our
approach to layout generation within the diffusion frame-
work while discussing features inherent in layout compared
with text. Sec. 3.3 discusses how we extend denoising steps
to perform various conditional layout generation by impos-
ing conditions in each step of the reverse process.
10168
3.1. Preliminary: Discrete Diffusion Models
Diffusion models [38] are generative models character-
ized by a forward and reverse Markov process. While
many diffusion models are defined on continuous space
with Gaussian corruption, D3PM [3] introduces a general
diffusion framework for categorical variables designed pri-
marily for texts. Let T∈Nbe a total timestep of the dif-
fusion model, we first explain the forward diffusion pro-
cess. For a scalar discrete variable with Kcategories
zt∈ {1,2, . . . , K}at timestep t∈N, probabilities that
zt−1transits to ztare defined by using a transition matrix
Qt∈[0,1]K×K, with [Qt]mn =q(zt=m|zt−1=n),
q(zt|zt−1) = v(zt)
⊤Qtv(zt−1),(1)
where v(zt)∈ {0,1}Kis a column one-hot vector of zt.
The categorical distribution over ztgiven zt−1is computed
by a column vector Qtv(zt−1)∈[0,1]K. Assuming the
Markov property, we can derive q(zt|z0) = v(zt)
⊤Qtv(z0)
where Qt=QtQt−1· · · Q1and:
q(zt−1|zt, z0) = q(zt|zt−1, z0)q(zt−1|z0)
q(zt|z0)
=v(zt)
⊤Qtv(zt−1)v(zt−1)
⊤Qt−1v(z0)
v(zt)
⊤Qtv(z0).(2)
Note that due to the Markov property, q(zt|zt−1, z0) =
q(zt|zt−1). When we consider N-dimensional variables
zt∈ {1,2, . . . , K}N, the corruption is applied to each vari-
able ztindependently. In the following, we explain with
N-dimensional variables zt.
In contrast to the forward process, the reverse denois-
ing process considers a conditional distribution of zt−1over
ztby a neural network pθ(zt−1|zt)∈[0,1]N×K.zt−1is
sampled according to this distribution. Note that the typical
implementation is to predict unnormalized log probabilities
log pθ(zt−1|zt)by a stack of bidirectional Transformer en-
coder blocks. D3PM uses a neural network ˜pθ(˜
z0|zt), com-
bines it with the posterior q(zt−1|zt,z0), and sums over
possible ˜
z0to obtain the following parameterization:
pθ(zt−1|zt)∝X
˜
z0
q(zt−1|zt,˜
z0) ˜pθ(˜
z0|zt).(3)
In addition to the commonly used variational lower
bound objective Lvb, D3PM introduces an auxiliary denois-
ing objective. The overall objective is as follows:
Lλ=Lvb +λE
zt∼q(zt|z0)
z0∼q(z0)
[−log ˜pθ(z0|zt)] ,(4)
where λis a hyper-parameter to balance the two loss terms.
Although D3PM proposes many variants of Qt, VQD-
iffusion [9] offers an improved version of Qtcalled mask-
and-replace strategy. They introduce an additional special
...
...
Figure 2. Overview of the corruption and denoising processes in
LayoutDM. For simplicity, we use a toy layout consisting of two
elements and the model generates three elements at maximum.
token [MASK] and three probabilities γtof replacing the
current token with the [MASK] token, βtof replacing the
token with other tokens, and αtof not changing the token.
The [MASK] token never transitions to other states. The
transition matrix Qt∈[0,1](K+1)×(K+1) is defined by:
Qt=
αt+βtβt· · · βt0
βtαt+βt· · · βt0
.
.
..
.
....βt0
βtβtβtαt+βt0
γtγtγtγt1
.(5)
(αt, βt, γt)is carefully designed so that ztconverges to the
[MASK] token for sufficiently large t. During testing, we
start from zTfilled with [MASK] tokens and iteratively
sample new set of tokens zt−1from pθ(zt−1|zt).
3.2. Unconditional Layout Generation
A layout lis a set of elements represented by l=
{(c1,b1),...,(cE,bE)}.E∈Nis the number of ele-
ments in the layout. ci∈ {1, . . . , C}is categorical infor-
mation of the i-th element in the layout. bi∈[0,1]4is
the bounding box of the i-th element in normalized coordi-
nates, where the first two values indicate the center location,
and the last two indicate the width and height. Following
previous works [2,11,20] that regard layout generation as
generating a sequence of tokens, we quantize each value in
biand obtain [xi, yi, wi, hi]⊤∈ {1, . . . , B}4, where Bis
a number of the bins. The layout lis now represented by
l={(c1, x1, y1, w1, h1), . . .}.
In this work, we corrupt a layout in a modality-wise
manner in the forward process, and we denoise the cor-
rupted layout while considering all elements and modal-
ities in the reverse process, as we illustrate in Fig. 2.
Similarly to D3PM [3], we parameterize pθby a Trans-
former encoder [40], which processes an ordered 1D se-
quence. To process lby pθwhile avoiding the or-
der dependency issue [20], we randomly shuffle lin
element-wise manner and then flatten it to produce lflat =
(c1, x1, y1, w1, h1, c2, x2, . . .).
10169
Variable length generation Existing diffusion models
generate fixed-dimensional data and are not directly appli-
cable to the layout generation because the number of ele-
ments in each layout varies. To handle this, we introduce a
[PAD] token and define a maximum number of elements
in the layout as M∈N. Each layout is fixed-dimensional
data composed of 5Mtokens by appending 5(M−E)
[PAD] tokens. [PAD] is treated similarly to the ordi-
nary token in VQDiffusion and Qt’s dimension becomes
(K+ 2) ×(K+ 2).
Modality-wise diffusion Discrete state-space models as-
sume that all the standard tokens are switchable by corrup-
tion. However, layout tokens comprise a disjoint set of to-
ken groups for each attribute in the element. For example,
applying the transition rule Eq. (5) may change a token rep-
resenting an element’s category to another token represent-
ing the width. To avoid such invalid switching, we propose
to apply disjoint corruption matrices Qc
t,Qx
t,Qy
t,Qw
t,Qh
t
for tokens representing different attributes c, x, y, w, h, as
we show in Fig. 2. The size of each matrix is (C+ 2) ×
(C+ 2) for Qc
tand otherwise (B+ 2) ×(B+ 2), where
+2 is for [PAD] and [MASK].
Adaptive Quantization The distribution of the position
and size information in layouts is highly imbalanced; e.g.,
elements tend to be aligned to either left, center, or right.
Applying uniform quantization to those quantities as in ex-
isting layout generation models [2,11,20] results in the loss
of information. As a pre-processing, we propose to apply
a classical clustering algorithm, such as KMeans [29] on x,
y,w, and hindependently to obtain balanced position and
size tokens for each dataset. We show in Sec. 4.7 how quan-
tization strategy affects the resulting quality.
Decoupled Positional Encoding Previous works apply
standard positional encoding to a flattened sequence of lay-
out tokens lflat [2,11,20]. We argue that this flattening ap-
proach could lose the structure information of the layout and
lead to inferior generation performance. In layout, each to-
ken has two types of indices: i-th element and j-th attribute.
We empirically find that independently applying positional
encoding to those indices improves final generation perfor-
mance, which we study in Sec. 4.7.
3.3. Conditional Generation
We elaborate on solving various conditional layout gen-
eration tasks using pre-trained frozen LayoutDM. We inject
conditional information in both the initial state zTand sam-
pled states {zt}T−1
t=0 during inference but do not modify the
denoising network pθ. The actual implementation of the in-
jection differs by the type of conditions.
Strong Constraints The most typical condition is par-
tially known layout fields. Let zknown ∈ZNcontain the
known fields and m∈ {0,1}Nbe a mask vector denot-
ing the known and unknown field as 1and 0, respectively.
In each timestep t, we sample ˆ
zt−1from pθ(zt−1|zt)in
Eq. (3) and then inject the condition by zt−1=m⊙
zknown+(1−m)⊙ˆ
zt−1, where 1denotes a N-dimensional
all-ones vector and ⊙denotes element-wise product.
Weak Constraints We may impose a weaker constraint
during generation, such as an element in the center. We
offer a way to impose such constraints in a unified frame-
work without additional training or external neural network
models. We propose to adjust the logits to inject weak con-
straints in log probability space by
log ˆpθ(zt−1|zt)∝log pθ(zt−1|zt) + λππ,(6)
where π∈RN×Kis a prior term that weights the de-
sired outputs, and λπ∈Ris a hyper-parameter. The
prior term can be defined either hard-coded (Refinement in
Sec. 4.5) or through differentiable loss functions (Relation-
ship in Sec. 4.5). Let {Li}L
i=1 be a set of differentiable loss
functions given the prediction, the later prior definition can
be written by:
π=−∇pθ(zt−1|zt)
L
X
i=1
Li(pθ(zt−1|zt)) .(7)
Although the formulation of Eq. (7) resembles steering dif-
fusion models by gradients from external models [7,25],
our primal focus is incorporating classical hand-crafted en-
ergies for aesthetics principles of layout [32] that do not de-
pend on an external model. In practice, we tune the hyper-
parameters for imposing weak constraints, such as λπ. Note
that these hyper-parameters are only for inference and are
easier to tune than the other training hyper-parameters.
4. Experiment
4.1. Datasets
We use two large-scale datasets for comparison, Rico [5]
and PubLayNet [45]. As we mention in Sec. 3.2, an ele-
ment in a layout for each dataset is described by the five
attributes. For preprocessing, we set the maximum number
of elements per layout Mto 25. If a layout contains more
elements, we discard the whole layout.
We provide an overview of each dataset. Rico is a dataset
of user interface designs for mobile applications containing
25 element categories such as text button, toolbar, and icon.
We divide the dataset into 35,851 / 2,109 / 4,218 samples for
train, validation, and test splits. PubLayNet is a dataset of
research papers containing five element categories, such as
table, image, and text. We divide the dataset into 315,757 /
16,619 / 11,142 samples for train, validation, and test splits.
10170
4.2. Evaluation Metrics
We employ two primary metrics: FID and Maximum
IoU (Max.). These metrics take into account both fidelity
and diversity [12], which are two mutually complemen-
tary properties widely used in evaluating generative mod-
els. FID [12] captures the similarity of generated data to
real ones in feature space. We employ an improved feature
extraction model for layouts [19] instead of a conventional
method [21] to compute FID. Maximum IoU [19] measures
the conditional similarity between generated and real lay-
outs. The similarity is measured by computing optimal
matching that maximizes average IoU between generated
and real layouts that have an identical set of categories. For
reference, we show the FID and Maximum IoU computed
between the validation and test data as Real data.
4.3. Tasks and Baselines
We test LayoutDM on six tasks for evaluation.
Unconditional generates layouts without any conditional
input or constraint.
Category→size+position (C→S+P) is a generation task
conditioned on the category of each element [20].
Category+size→position (C+S→P) is conditioned on the
category and size of each element.
Completion is conditioned on a small number of elements
whose attributes are all known. Given a complete layout,
we randomly sample from 0% to 20% of elements.
Refinement is conditioned on a noisy layout in which
only geometric information is perturbed [36]. Following
RUITE [36], we synthesize the input layout by adding ran-
dom noise to the size and position of each element. We sam-
ple noise from a standard normal distribution with a mean
of 0 and a variance of 0.01.
Relationship is conditioned on the category of each ele-
ment and some relationship constraints between the ele-
ments [21]. Following CLG-LO [19], we employ the size
and location relationships and randomly sample 10% rela-
tionships between elements for the experiment.
The first four tasks handle basic layout fields. We include
a few task-agnostic models for comparison using existing
controllable layout generation methods or simple adaptation
of generative models in the following:
LayoutTrans is a simple autoregressive model [11] trained
on a element-level shuffled layout, following [33]. We set a
variable generation order to c→w→h→x→y.
MaskGIT∗is originally a non-autoregressive model for un-
conditional fixed-length data generation [4]. We use [PAD]
to enable variable-length generation.
BLT is a non-autoregressive model with layout-specific de-
coding strategy [20].
BART is a denoising autoencoder that can solve both
comprehension and generation tasks based on Transformer
encoder-decoder backbone [22]. We randomly generate a
number of [MASK] tokens from a uniform distribution be-
tween one and the sequence length, and perform masking
based on the number.
VQDiffusion∗is a diffusion-based model originally for
text-to-image generation [9]. We adapt the model for layout
using K=C+ 4B+ 2 tokens, including [PAD].
4.4. Implementation Details
We re-implement most of the models since there are few
official implementations publicly available except [11,19,
20]2. We train all the models on the two datasets with three
independent trials and report the average of the results.
LayoutDM follows VQDiffusion for hyper-parameters
unless specified, such as configurations for pθand the tran-
sition matrix parameters i.e.αtand γt. We set the loss
weight λ= 0.1(in Eq. (4)) and the diffusion timesteps
T= 100. For optimization, we use AdamW [27] with
learning rate of 5.0×10−4,β1= 0.9, and β2= 0.98.
Many models, including LayoutDM, use Trans-
former [40] encoder backbone. We define a shared
configuration as follows: 4 layers, 8 attention heads, 512
embedding dimensions, 2048 hidden dimensions, and 0.1
dropout rate. For other models with extra modules, we ad-
just the number of hidden dimensions to roughly match the
number of parameters for a fair comparison. We randomly
shuffle elements in the layout to avoid fixed-order gener-
ation during training. We search best hyper-parameters to
obtain the best FID using the validation set.
4.5. Quantitative Evaluation
C→S+P, C+S→P, Completion In these tasks, we in-
ject conditions by masking. We summarize comparisons
in Tab. 1. As task-specific models, we include Layout-
VAE [16], NDN-none [21], and LayoutGAN++ [19] for
C→S+P. We also adapt these models for C+S→P. Lay-
outDM outperforms other models except LayoutTrans [11]
in completion. The significant performance gap between
LayoutDM and VQDiffusion* suggests the contribution of
our proposals to go beyond the simple discrete diffusion
models discussed in Sec. 3.2. Results in the completion
suggest that a combination of padding and diffusion mod-
els is the primal key to the generation quality. We find that
FID and Maximum IoU are not highly correlated only in the
completion task. We conjecture that Maximum IoU may
become unstable when categories are also predicted, unlike
the C→S+P and C+S→P tasks where categories are given.
Fig. 3shows the qualitative results of some models, in-
cluding LayoutDM. We can see that LayoutDM generates
2Unfortunately, most datasets have no official train-val-test splits, and
previous approaches work on different splits and pre-processing strategies.
Furthermore, models for FID computation also vary. Thus, we cannot di-
rectly compare our results with the reported figure in the literature.
10171
Table 1. Quantitative comparison in conditional generation given partially known fields. Top two results are highlighted in bold and
underline, respectively. †indicates the results of BLT trained with [PAD] as an additional vocabulary since the original model cannot
perform unordered completion in practice.
Task Category →Size+Position Category+Size →Position Completion
Dataset Rico PubLayNet Rico PubLayNet Rico PubLayNet
Model FID ↓Max. ↑FID ↓Max. ↑FID ↓Max. ↑FID ↓Max. ↑FID ↓Max. ↑FID ↓Max. ↑
Task-specific models
LayoutVAE [16] 33.3 0.249 26.0 0.316 30.6 0.283 27.5 0.315 - - - -
NDN-none [21] 28.4 0.158 61.1 0.162 62.8 0.219 69.4 0.222 - - - -
LayoutGAN++ [19] 6.84 0.267 24.0 0.263 6.22 0.348 9.94 0.342 - - - -
Task-agnostic models
LayoutTrans [11] 5.57 0.223 14.1 0.272 3.73 0.323 16.9 0.320 3.71 0.537 8.36 0.451
MaskGIT∗[4] 26.1 0.262 17.2 0.319 8.05 0.320 5.86 0.380 33.5 0.533 19.7 0.484
BLT [20] 17.4 0.202 72.1 0.215 4.48 0.340 5.10 0.387 117†0.471†131†0.345†
BART [22] 3.97 0.253 9.36 0.320 3.18 0.334 5.88 0.375 8.87 0.527 9.58 0.446
VQDiffusion∗[9] 4.34 0.252 10.3 0.319 3.21 0.331 7.13 0.374 11.0 0.541 11.1 0.373
LayoutDM 3.55 0.277 7.95 0.310 2.22 0.392 4.25 0.381 9.00 0.576 7.65 0.377
Real data 1.85 0.691 6.25 0.438 1.85 0.691 6.25 0.438 1.85 0.691 6.25 0.438
Rico PubLayNet
Condition Layout-
Trans [11]BLT [20] BART [22] LayoutDM Real Condition Layout-
Trans [11]BLT [20] BART [22] LayoutDM Real
C→S+PC+S→P
Completion
Figure 3. Comparison in conditional generation given partially known fields.
high-quality layouts with fewer layout aesthetics violations,
such as misalignment and overlap, given diverse conditions.
Unconditional Generation Tab. 2summarizes the results
of unconditional generation. Unconditional layout genera-
tion methods often assume fixed order for element genera-
tion e.g. top-to-bottom rather than random order for better
generation quality by constraining the prediction. For refer-
ence, we additionally report the results of LayoutTrans [11]
trained on the fixed element order (LayoutTrans-fixed). Al-
though we design LayoutDM’s primarily for conditional
generation, LayoutDM achieves the best FID under random
element order setting. We conjecture that BLT’s poor per-
formance is due to train-test mask distribution inconsistency
caused by their hierarchical masking strategy for training.
BLT masks a randomly sampled number of fields from a
single semantic group i.e. category, position, or size. How-
ever, decoding starts with all masked tokens in inference.
The alignment metric of Real data stays at 0.109 in Rico.
Too small alignment values of LayoutTrans and MaskGIT
can be a signal of producing trivial outputs in Rico.
10172
Table 2. Quantitative comparison in unconditional generation. Top
two results are highlighted in bold and underline, respectively.
Dataset Rico PubLayNet
Model FID ↓Align. ↓FID ↓Align. ↓
LayoutTrans-fixed [11]6.47 0.133 17.1 0.084
LayoutTrans [11] 7.63 0.068 13.9 0.127
MaskGIT∗[4] 52.1 0.015 27.1 0.101
BLT [20] 88.2 1.030 116 0.153
BART [22] 11.9 0.090 16.6 0.116
VQDiffusion∗[9] 7.46 0.178 15.4 0.193
LayoutDM 6.65 0.162 13.9 0.195
Real data 1.85 0.109 6.25 0.0214
Table 3. Quantitative comparison in the refinement task. Top two
results are highlighted in bold and underline, respectively.
Dataset Rico PubLayNet
Model FID ↓Max. ↑Sim ↑FID ↓Max. ↑Sim ↑
Task-specific models
RUITE [36] 3.23 0.421 0.221 6.39 0.415 0.174
Task-agnostic models
Noisy input 134 0.213 0.177 130 0.242 0.147
LayoutDM 2.77 0.370 0.205 6.75 0.352 0.149
w/o logit adj. 3.55 0.277 0.168 7.95 0.310 0.127
Real data 1.85 0.691 0.260 6.25 0.438 0.216
Input RUITE [36] LayoutDM Real
Rico
PubLayNet
Figure 4. Qualitative comparison in the refinement task.
Refinement Our LayoutDM performs this task with a
combination of the strong constraints of element categories,
i.e., setting zknown ={(c1,[MASK],...,[MASK]), . . .},
and the weak constraints that geometric outputs appear near
noisy inputs. As an example of the weak constraint, we
describe a constraint that imposes the x-coordinate estimate
of i-th element close to the noisy continuous observation ˆxi.
We denote a sliced vector of the prior term πin Eq. (6) that
corresponds to the x-coordinate of i-th element as πi
x∈RK
100101102
FID
0
10
20
30
40
50
Violation
Rico
100101102
FID
0
10
20
30
40
50
Violation
PubLayNet
LayoutGAN++
w/ CLG-LO
LayoutDM
w/o logit adjustment
NDN-partial
Figure 5. Quality-violation trade-off in the relationship task.
Lower scores indicate better performance for both metrics.
Relationship w/o rel. w/ rel. Real
Rico
PubLayNet
Figure 6. Qualitative comparison in the relationship task.
and define by:
πi
xj=(1if |loc(j)−ˆxi|< m and j∈X
0otherwise,(8)
where mis a hyper-parameter indicating a margin, Xis a
set of indices denoting tokens for xin the vocabularies, and
loc(j)is a function that returns the centroid value of j-th
token in the vocabularies. We define similar constraints for
the other geometric variables and elements.
We summarize the performance in Tab. 3. We addition-
ally report DocSim [34] (Sim) to measure the similarity of a
predicted and its corresponding ground truth layout. Impos-
ing noisy geometric fields as a weak prior significantly im-
proves the masking-only model and makes the performance
much closer to RUITE [36], which is a denoising model not
applicable to other layout tasks. We compare some results
in Fig. 4. Both LayoutDM and RUITE successfully recover
complete layouts from non-trivially noisy layouts.
Relationship We use Eq. (7) to incorporate the relational
constraints during the sampling step of LayoutDM. We fol-
low [19] to employ the loss functions penalizing size and
10173
0 50 100 150
Time per sample [ms]
100
101
102
FID
PubLayNet
LayoutVAE
NDN-none
LayoutGAN++
LayoutTrans.
BART
MaskGIT*
BLT
VQDiffusion*
LayoutDM
Real Data
Figure 7. Speed-quality trade-off of different models for C+S→P.
Table 4. Ablation study results on layout-specific modification in
unconditional generation of Rico [5] dataset.
FID ↓Align. ↓
LayoutDM 6.65 0.162
w/o modality-wise diff. 7.32 0.156
w/o decoupled pos. enc. 6.78 0.227
w/ uniform-quantization 7.58 0.256
w/ percentile-quantization 9.79 0.232
Real Data 1.85 0.109
location relationships between elements that do not match
user specifications. We define the loss functions for contin-
uous bounding boxes, and we have to convert the predicted
discrete bounding boxes to continuous ones in a differen-
tiable manner. Given estimated probabilities of discrete x-
coordinates p(x), for example, we compute the continuous
x-coordinate ¯xby ¯x=Pn∈Xp(x=n) loc(n). Similar
conversion applies to the other attributes. Empirically, we
find that applying the logit adjustment multiple times (three
times in our experiments) to each diffusion step moderately
improves performance.
We compare LayoutDM with two task-specific ap-
proaches: NDN-partial [21] and CLG-LO based on Lay-
outGAN++ [19]. We show the results in Fig. 5. We ad-
ditionally report constraint violation error rates [19]. Lay-
outDM can control the strength of the logit adjustment as
in Eq. (6) and produces an FID-violation trade-off curve.
LayoutDM is comparable to NDN-partial in Rico and out-
performs NDN-partial by a large margin in PubLayNet. Al-
though LayoutDM is inferior to CLG-LO in both datasets,
note that the average runtime of CLG-LO is 4.0s, which is
much slower than 0.5s in LayoutDM. We show some results
of LayoutDM in Fig. 6.
4.6. Speed-Quality Trade-off
Runtime is also essential for a controllable generation.
We show a speed-quality trade-off curve for C+S→P as
shown in Fig. 7. The Transformer encoder-only models,
such as LayoutDM and BLT, can achieve fast generation at
the sacrifice of quality. We employ fast-sampling strategy
employed in discrete diffusion models [3] for LayoutDM
by pθ(zt−∆|zt)∝P˜
z0q(zt−∆,zt|˜
z0)˜pθ(˜
z0|zt), where
∆∈Nindicates a step size for generation in T
∆steps. De-
spite being a task-agnostic model, LayoutDM achieves the
best quality-speed trade-off except for task-specific Layout-
GAN++ [19] that runs under 10ms.
4.7. Ablation Study
We investigate whether techniques in Sec. 3.2 improve
the performance. First, we evaluate a choice of quantization
methods for the geometric fields of elements. Instead of
KMeans, we compute centroids for the quantization by:
• Uniform: This is dataset-agnostic quantization, which is
popular in previous works [2,11,20]. Following [11], we
choose {0.0,1
B,..., B−1
B}and {1
B,..., B−1
B,1.0}for
the position and size, respectively.
• Percentile: we sort the data into equally sized groups and
obtain average values for each group as the centroids.
This is dataset-specific quantization similar to KMeans.
We show the result at the bottom of Tab. 4. We additionally
report the Alignment metric (Align.) used in [19] since the
choice of the quantization affects the alignment between el-
ements. Compared to Linear and Percentile, KMeans quan-
tization significantly improves both FID and Alignment.
We confirm that our modality-wise diffusion and decou-
pled positional encoding both moderately improve the per-
formance, as we show at the top half of Tab. 4.
5. Discussion
LayoutDM is based on diffusion models for discrete
state-space. Using continuous state-space as in latent dif-
fusion models [37] would be interesting. Extension of
LayoutDM to handle various layout properties such as
color [17] and image/text [41] is also appealing.
We believe our proposed logit adjustment can incorpo-
rate more attributes. Attribute-conditional LayoutGAN [24]
considers area, aspect ratio, and reading order of elements
for fine-grained control. Since these attributes can be easily
converted to the size and location relationship constraints,
incorporating them with our LayoutDM is not very difficult.
Potential negative impact Our model might be used to
automatically generate the basic structure of fake websites
or mobile applications, which could lead to scams or the
spreading of misinformation.
10174
References
[1] Maneesh Agrawala, Wilmot Li, and Floraine Berthouzoz.
Design principles for visual communication. Communica-
tions of the ACM, 54(4), 2011. 2
[2] Diego Martin Arroyo, Janis Postels, and Federico Tombari.
Variational transformer networks for layout generation. In
CVPR, 2021. 1,2,3,4,8
[3] Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tar-
low, and Rianne van den Berg. Structured denoising diffu-
sion models in discrete state-spaces. In NeurIPS, 2021. 1,2,
3,8
[4] Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T
Freeman. MaskGIT: Masked generative image transformer.
In CVPR, 2022. 5,6,7
[5] Biplab Deka, Zifeng Huang, Chad Franzen, Joshua Hib-
schman, Daniel Afergan, Yang Li, Jeffrey Nichols, and Ran-
jitha Kumar. Rico: A mobile app dataset for building data-
driven design applications. In UIST, 2017. 1,2,4,8
[6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
Toutanova. BERT: Pre-training of deep bidirectional trans-
formers for language understanding. In NAACL, 2019. 2
[7] Prafulla Dhariwal and Alexander Nichol. Diffusion models
beat gans on image synthesis. In NeurIPS, 2021. 2,4
[8] Tsu-Jui Fu, William Yang Wang, Daniel McDuff, and Yale
Song. DOC2PPT: Automatic presentation slides generation
from scientific documents. In AAAI, 2022. 2
[9] Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo
Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vec-
tor quantized diffusion model for text-to-image synthesis. In
CVPR, 2022. 1,2,3,5,6,7
[10] Shunan Guo, Zhuochen Jin, Fuling Sun, Jingwen Li, Zhaorui
Li, Yang Shi, and Nan Cao. Vinci: an intelligent graphic
design system for generating advertising posters. In CHI,
2021. 2
[11] Kamal Gupta, Alessandro Achille, Justin Lazarow, Larry
Davis, Vijay Mahadevan, and Abhinav Shrivastava. Layout-
Transformer: Layout generation and completion with self-
attention. In ICCV, 2021. 1,2,3,4,5,6,7,8
[12] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,
Bernhard Nessler, and Sepp Hochreiter. Gans trained by a
two time-scale update rule converge to a local nash equilib-
rium. In NeurIPS, 2017. 5
[13] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu-
sion probabilistic models. In NeurIPS, 2020. 1,2
[14] Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick
Forr´
e, and Max Welling. Argmax flows and multinomial
diffusion: Towards non-autoregressive language models. In
NeurIPS, 2021. 1,2
[15] Zhaoyun Jiang, Shizhao Sun, Jihua Zhu, Jian-Guang Lou,
and Dongmei Zhang. Coarse-to-fine generative modeling for
graphic layouts. In AAAI, 2022. 2
[16] Akash Abdu Jyothi, Thibaut Durand, Jiawei He, Leonid Si-
gal, and Greg Mori. LayoutVAE: Stochastic scene layout
generation from a label set. In CVPR, 2019. 2,5,6
[17] Kotaro Kikuchi, Naoto Inoue, Mayu Otani, Edgar Simo-
Serra, and Kota Yamaguchi. Generative colorization of struc-
tured mobile web pages. In WACV, 2023. 8
[18] Kotaro Kikuchi, Mayu Otani, Kota Yamaguchi, and Edgar
Simo-Serra. Modeling visual containment for web page lay-
out optimization. Computer Graphics Forum, 40(7), 2021.
2
[19] Kotaro Kikuchi, Edgar Simo-Serra, Mayu Otani, and Kota
Yamaguchi. Constrained graphic layout generation via latent
optimization. In ACM MM, 2021. 1,2,5,6,7,8
[20] Xiang Kong, Lu Jiang, Huiwen Chang, Han Zhang, Yuan
Hao, Haifeng Gong, and Irfan Essa. BLT: Bidirectional lay-
out transformer for controllable layout generation. In ECCV,
2022. 1,2,3,4,5,6,7,8
[21] Hsin-Ying Lee, Weilong Yang, Lu Jiang, Madison Le, Ir-
fan Essa, Haifeng Gong, and Ming-Hsuan Yang. Neural de-
sign network: Graphic layout generation with constraints. In
ECCV, 2020. 1,2,5,6,8
[22] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvinine-
jad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and
Luke Zettlemoyer. BART: Denoising sequence-to-sequence
pre-training for natural language generation, translation, and
comprehension. In ACL, 2020. 5,6,7
[23] Jianan Li, Jimei Yang, Aaron Hertzmann, Jianming Zhang,
and Tingfa Xu. LayoutGAN: Generating graphic layouts
with wireframe discriminators. In ICLR, 2019. 2
[24] Jianan Li, Jimei Yang, Jianming Zhang, Chang Liu,
Christina Wang, and Tingfa Xu. Attribute-conditioned lay-
out gan for automatic graphic design. IEEE TVCG, 27(10),
2020. 8
[25] Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and
Joshua B Tenenbaum. Compositional visual generation with
composable diffusion models. In ECCV, 2022. 4
[26] Simon Lok and Steven Feiner. A survey of automated layout
techniques for information presentations. In SmartGraphics,
2001. 2
[27] Ilya Loshchilov and Frank Hutter. Decoupled weight decay
regularization. In ICLR, 2019. 5
[28] Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher
Yu, Radu Timofte, and Luc Van Gool. RePaint: Inpainting
using denoising diffusion probabilistic models. In CVPR,
2022. 2
[29] J MacQueen. Classification and analysis of multivariate ob-
servations. In 5th Berkeley Symp. Math. Statist. Probability,
1967. 4
[30] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jia-
jun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided
image synthesis and editing with stochastic differential equa-
tions. In ICLR, 2022. 2
[31] Paul Merrell, Eric Schkufza, Zeyang Li, Maneesh Agrawala,
and Vladlen Koltun. Interactive furniture layout using inte-
rior design guidelines. ACM TOG, 30(4), 2011. 2
[32] Peter O’Donovan, Aseem Agarwala, and Aaron Hertzmann.
Learning layouts for single-pagegraphic designs. IEEE
TVCG, 20(8), 2014. 2,4
[33] Despoina Paschalidou, Amlan Kar, Maria Shugrina, Karsten
Kreis, Andreas Geiger, and Sanja Fidler. ATISS: Autoregres-
sive transformers for indoor scene synthesis. In NeurIPS,
2021. 2,5
10175
[34] Akshay Gadi Patil, Omri Ben-Eliezer, Or Perel, and Hadar
Averbuch-Elor. READ: Recursive autoencoders for docu-
ment layout generation. In CVPRW, 2020. 7
[35] Chunyao Qian, Shizhao Sun, Weiwei Cui, Jian-Guang Lou,
Haidong Zhang, and Dongmei Zhang. Retrieve-then-adapt:
Example-based automatic generation for proportion-related
infographics. IEEE TVCG, 27(2), 2020. 2
[36] Soliha Rahman, Vinoth Pandian Sermuga Pandian, and
Matthias Jarke. RUITE: Refining ui layout aesthetics using
transformer encoder. In 26th International Conference on
Intelligent User Interfaces-Companion, 2021. 2,5,7
[37] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
Patrick Esser, and Bj¨
orn Ommer. High-resolution image syn-
thesis with latent diffusion models. In CVPR, 2022. 8
[38] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan,
and Surya Ganguli. Deep unsupervised learning using
nonequilibrium thermodynamics. In ICML, 2015. 2,3
[39] Yang Song and Stefano Ermon. Improved techniques for
training score-based generative models. In NeurIPS, 2020.
2
[40] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. Attention is all you need. In NeurIPS, 2017. 3,
5
[41] Kota Yamaguchi. CanvasVAE: Learning to generate vector
graphics documents. In ICCV, 2021. 2,8
[42] Xuyong Yang, Tao Mei, Ying-Qing Xu, Yong Rui, and
Shipeng Li. Automatic generation of visual-textual presen-
tation layout. TOMM, 12(2), 2016. 2
[43] Lap Fai Yu, Sai Kit Yeung, Chi Keung Tang, Demetri Ter-
zopoulos, Tony F Chan, and Stanley J Osher. Make it home:
automatic optimization of furniture arrangement. ACM TOG,
30(4), 2011. 2
[44] Xinru Zheng, Xiaotian Qiao, Ying Cao, and Rynson WH
Lau. Content-aware generative modeling of graphic design
layouts. ACM TOG, 38(4), 2019. 2
[45] Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes. Pub-
LayNet: largest dataset ever for document layout analysis. In
ICDAR, 2019. 1,2,4
10176