Conference PaperPDF Available

LayoutDM: Discrete Diffusion Model for Controllable Layout Generation

Authors:
  • CyberAgent Inc, Japan, Tokyo

Figures

Content may be subject to copyright.
LayoutDM: Discrete Diffusion Model for Controllable Layout Generation
Naoto Inoue1Kotaro Kikuchi1Edgar Simo-Serra2Mayu Otani1Kota Yamaguchi1
1CyberAgent, Japan 2Waseda University, Japan
{inoue naoto, kikuchi kotaro xa}@cyberagent.co.jp ess@waseda.jp
{otani mayu, yamaguchi kota}@cyberagent.co.jp
Abstract
Controllable layout generation aims at synthesizing
plausible arrangement of element bounding boxes with op-
tional constraints, such as type or position of a specific el-
ement. In this work, we try to solve a broad range of lay-
out generation tasks in a single model that is based on dis-
crete state-space diffusion models. Our model, named Lay-
outDM, naturally handles the structured layout data in the
discrete representation and learns to progressively infer a
noiseless layout from the initial input, where we model the
layout corruption process by modality-wise discrete diffu-
sion. For conditional generation, we propose to inject lay-
out constraints in the form of masking or logit adjustment
during inference. We show in the experiments that our Lay-
outDM successfully generates high-quality layouts and out-
performs both task-specific and task-agnostic baselines on
several layout tasks.1
1. Introduction
Graphic layouts play a critical role in visual communica-
tion. Automatically creating a visually pleasing layout has
tremendous application benefits that range from authoring
of printed media [45] to designing application user inter-
face [5], and there has been a growing research interest in
the community. The task of layout generation considers the
arrangement of elements, where each element has a tuple
of attributes, such as category, position, or size, and de-
pending on the task setup, there could be optional control
inputs that specify part of the elements or attributes. Due
to the structured nature of layout data, it is crucial to con-
sider relationships between elements in a generation. For
this reason, current generation approaches either build an
autoregressive model [2,11] or develop a dedicated infer-
ence strategy to explicitly consider relationships [1921].
In this paper, we propose to utilize discrete state-space
1Please find the code and models at:
https://cyberagentailab.github.io/layout-dm.
...
Category
Generated layout(s)
LayoutDM
Category +sizeCompletion
Refinement Category+relationship
+
... ...
4 33 211...M M M M ...M M ...
corrupt
conditio
(optional)
denoise
Initial layout Final layout
Figure 1. Overview of LayoutDM. Top: LayoutDM is trained to
gradually generate a complete layout from a blank state in discrete
state space. Bottom: During sampling, we can steer LayoutDM
to perform various conditional generation tasks without additional
training or external models.
diffusion models [3,9,14] for layout generation tasks. Dif-
fusion models have shown promising performance for var-
ious generation tasks, including images and texts [13].
We formulate the diffusion process for layout structure by
modality-wise discrete diffusion, and train a denoising back-
bone network to progressively infer the complete layout
with or without conditional inputs. To support variable-
length layout data, we extend the discrete state-space with
a special PAD token instead of the typical end-of-sequence
token used in autoregressive models. Our model can incor-
porate complex layout constraints via logit adjustment, so
that we can refine an existing layout or impose relative size
constraints between elements without additional training.
We discuss two key advantages of LayoutDM over ex-
isting models for conditional layout generation. Our model
avoids the immutable dependency chain issue [20] that hap-
This CVPR paper is the Open Access version, provided by the Computer Vision Foundation.
Except for this watermark, it is identical to the accepted version;
the final published version of the proceedings is available on IEEE Xplore.
10167
pens in autoregressive models [11]. Autoregressive mod-
els fail to perform conditional generation when the con-
dition disagrees with the pre-defined generation order of
elements and attributes. Unlike non-autoregressive mod-
els [20], our model can generate variable-length elements.
We empirically show in Sec. 4.5 that naively extending
non-autoregressive models by padding results in suboptimal
variable length generation while padding combined with
our diffusion formulation leads to significant improvement.
We evaluate LayoutDM on various layout generation
tasks tackled by previous works [20,21,33,36] using two
large-scale datasets, Rico [5] and PubLayNet [45]. Lay-
outDM outperforms task-agnostic baselines in the major-
ity of cases and shows promising performance compared
with task-specific baselines. We further conduct an ablation
study to prove the significant impact of our design choices
in LayoutDM, including quantization of continuous vari-
ables and positional embedding.
We summarize our contributions as follows:
We formulate the discrete diffusion process for layout
generation and propose a modality-wise diffusion and a
padding approach to model highly structured layout data.
We propose to inject complex layout constraints via
masking and logit adjustment during the inference, so that
our model can solve diverse tasks in a single model.
We empirically show solid performance for various con-
ditional layout generation tasks on public datasets.
2. Related Work
2.1. Layout Generation
Studies on automatic layout generation have appeared
several times in literature [1,26,32,42]. Layout tasks are
commonly observed in design applications, including mag-
azine covers, posters, presentation slides, application user
interface, or banner advertising [5,8,10,18,35,41,42,44].
Recent approaches to layout generation consider both un-
conditional generation [2,11,15,16] and conditional gener-
ation in various setups, such as conditional inputs of cate-
gory or size [1921,23], relational constraints [19,21], el-
ement completion [11], and refinement [36]. Some attempt
at solving multiple tasks in a single model [20,33].
BLT [20] points out that the recent autoregressive de-
coders [2,11] are not fully capable of considering partial
inputs, i.e. known elements or attributes, during generation
because they have a fixed generation order. BLT addresses
the conditional generation by fill-in-the-blank task formu-
lation using a bidirectional Transformer encoder similar to
masked language models [6]. However, BLT cannot solve
layout completion demonstrated in the decoder-based mod-
els because of the requirement of the known number of el-
ements. Our LayoutDM enjoys the best of both worlds and
supports a broader range of conditional generation tasks in
a single model.
Another layout-specific consideration is the complex
user-specified constraints, such as the positional require-
ments between two boxes (e.g., a header box should be on
top of a paragraph box). Earlier approaches [31,32,43]
propose hand-crafted cost functions representing the vio-
lation degree of aesthetic constraints so that those con-
straints guide the optimization process of layout inference.
CLG-LO [19] proposes an aesthetically constrained opti-
mization framework for pre-trained GANs. Our LayoutDM
solves such constrained generation tasks on top of the task-
agnostic iterative prediction via logit adjustment.
2.2. Discrete Diffusion Models
Diffusion models [38] are generative models character-
ized by a forward and reverse Markov process. The for-
ward process corrupts the data into a sequence of increas-
ingly noisy variables. The reverse process gradually de-
noises the variables toward the actual data distribution. Dif-
fusion models are stable to train and achieve faster sampling
than autoregressive models by parallel iterative refinement.
Recently, many approaches have learned the reverse pro-
cess by a neural network and show strong empirical perfor-
mance [7,13,39] in continuous state spaces, such as images.
Discrete state spaces are a natural representation of dis-
crete variables, such as text. D3PM [3] extends the pioneer-
ing work of Hoogeboom et al. [14] to structured categor-
ical corruption processes for diffusion models in discrete
state spaces, while maintaining the advantages of diffusion
models for continuous state spaces. VQDiffusion [9] devel-
ops a corruption approach called mask-and-replace, so as
to avoid accumulated prediction errors that are common in
models based on iterative prediction. Following the corrup-
tion model of VQDiffusion, we carefully design a modality-
wise corruption process for layout tasks that involve tokens
from disjoint sets of vocabulary per modality.
Several studies consider a conditional input to the infer-
ence process of diffusion models. Some approaches alter
the reverse diffusion iteration to carefully inject given con-
ditions for free-form image inpainting [28] or image edit-
ing by strokes or composition [30]. We extend the discrete
state-space diffusion models via hard masking or logit ad-
justment to support the conditional generation of layouts.
3. LayoutDM
Our LayoutDM builds on discrete-state space diffusion
models [3,9]. We first briefly review the fundamental of
discrete diffusion models in Sec. 3.1. Sec. 3.2 explains our
approach to layout generation within the diffusion frame-
work while discussing features inherent in layout compared
with text. Sec. 3.3 discusses how we extend denoising steps
to perform various conditional layout generation by impos-
ing conditions in each step of the reverse process.
10168
3.1. Preliminary: Discrete Diffusion Models
Diffusion models [38] are generative models character-
ized by a forward and reverse Markov process. While
many diffusion models are defined on continuous space
with Gaussian corruption, D3PM [3] introduces a general
diffusion framework for categorical variables designed pri-
marily for texts. Let TNbe a total timestep of the dif-
fusion model, we first explain the forward diffusion pro-
cess. For a scalar discrete variable with Kcategories
zt {1,2, . . . , K}at timestep tN, probabilities that
zt1transits to ztare defined by using a transition matrix
Qt[0,1]K×K, with [Qt]mn =q(zt=m|zt1=n),
q(zt|zt1) = v(zt)
Qtv(zt1),(1)
where v(zt) {0,1}Kis a column one-hot vector of zt.
The categorical distribution over ztgiven zt1is computed
by a column vector Qtv(zt1)[0,1]K. Assuming the
Markov property, we can derive q(zt|z0) = v(zt)
Qtv(z0)
where Qt=QtQt1· · · Q1and:
q(zt1|zt, z0) = q(zt|zt1, z0)q(zt1|z0)
q(zt|z0)
=v(zt)
Qtv(zt1)v(zt1)
Qt1v(z0)
v(zt)
Qtv(z0).(2)
Note that due to the Markov property, q(zt|zt1, z0) =
q(zt|zt1). When we consider N-dimensional variables
zt {1,2, . . . , K}N, the corruption is applied to each vari-
able ztindependently. In the following, we explain with
N-dimensional variables zt.
In contrast to the forward process, the reverse denois-
ing process considers a conditional distribution of zt1over
ztby a neural network pθ(zt1|zt)[0,1]N×K.zt1is
sampled according to this distribution. Note that the typical
implementation is to predict unnormalized log probabilities
log pθ(zt1|zt)by a stack of bidirectional Transformer en-
coder blocks. D3PM uses a neural network ˜pθ(˜
z0|zt), com-
bines it with the posterior q(zt1|zt,z0), and sums over
possible ˜
z0to obtain the following parameterization:
pθ(zt1|zt)X
˜
z0
q(zt1|zt,˜
z0) ˜pθ(˜
z0|zt).(3)
In addition to the commonly used variational lower
bound objective Lvb, D3PM introduces an auxiliary denois-
ing objective. The overall objective is as follows:
Lλ=Lvb +λE
ztq(zt|z0)
z0q(z0)
[log ˜pθ(z0|zt)] ,(4)
where λis a hyper-parameter to balance the two loss terms.
Although D3PM proposes many variants of Qt, VQD-
iffusion [9] offers an improved version of Qtcalled mask-
and-replace strategy. They introduce an additional special
...
...
Figure 2. Overview of the corruption and denoising processes in
LayoutDM. For simplicity, we use a toy layout consisting of two
elements and the model generates three elements at maximum.
token [MASK] and three probabilities γtof replacing the
current token with the [MASK] token, βtof replacing the
token with other tokens, and αtof not changing the token.
The [MASK] token never transitions to other states. The
transition matrix Qt[0,1](K+1)×(K+1) is defined by:
Qt=
αt+βtβt· · · βt0
βtαt+βt· · · βt0
.
.
..
.
....βt0
βtβtβtαt+βt0
γtγtγtγt1
.(5)
(αt, βt, γt)is carefully designed so that ztconverges to the
[MASK] token for sufficiently large t. During testing, we
start from zTfilled with [MASK] tokens and iteratively
sample new set of tokens zt1from pθ(zt1|zt).
3.2. Unconditional Layout Generation
A layout lis a set of elements represented by l=
{(c1,b1),...,(cE,bE)}.ENis the number of ele-
ments in the layout. ci {1, . . . , C}is categorical infor-
mation of the i-th element in the layout. bi[0,1]4is
the bounding box of the i-th element in normalized coordi-
nates, where the first two values indicate the center location,
and the last two indicate the width and height. Following
previous works [2,11,20] that regard layout generation as
generating a sequence of tokens, we quantize each value in
biand obtain [xi, yi, wi, hi] {1, . . . , B}4, where Bis
a number of the bins. The layout lis now represented by
l={(c1, x1, y1, w1, h1), . . .}.
In this work, we corrupt a layout in a modality-wise
manner in the forward process, and we denoise the cor-
rupted layout while considering all elements and modal-
ities in the reverse process, as we illustrate in Fig. 2.
Similarly to D3PM [3], we parameterize pθby a Trans-
former encoder [40], which processes an ordered 1D se-
quence. To process lby pθwhile avoiding the or-
der dependency issue [20], we randomly shuffle lin
element-wise manner and then flatten it to produce lflat =
(c1, x1, y1, w1, h1, c2, x2, . . .).
10169
Variable length generation Existing diffusion models
generate fixed-dimensional data and are not directly appli-
cable to the layout generation because the number of ele-
ments in each layout varies. To handle this, we introduce a
[PAD] token and define a maximum number of elements
in the layout as MN. Each layout is fixed-dimensional
data composed of 5Mtokens by appending 5(ME)
[PAD] tokens. [PAD] is treated similarly to the ordi-
nary token in VQDiffusion and Qts dimension becomes
(K+ 2) ×(K+ 2).
Modality-wise diffusion Discrete state-space models as-
sume that all the standard tokens are switchable by corrup-
tion. However, layout tokens comprise a disjoint set of to-
ken groups for each attribute in the element. For example,
applying the transition rule Eq. (5) may change a token rep-
resenting an element’s category to another token represent-
ing the width. To avoid such invalid switching, we propose
to apply disjoint corruption matrices Qc
t,Qx
t,Qy
t,Qw
t,Qh
t
for tokens representing different attributes c, x, y, w, h, as
we show in Fig. 2. The size of each matrix is (C+ 2) ×
(C+ 2) for Qc
tand otherwise (B+ 2) ×(B+ 2), where
+2 is for [PAD] and [MASK].
Adaptive Quantization The distribution of the position
and size information in layouts is highly imbalanced; e.g.,
elements tend to be aligned to either left, center, or right.
Applying uniform quantization to those quantities as in ex-
isting layout generation models [2,11,20] results in the loss
of information. As a pre-processing, we propose to apply
a classical clustering algorithm, such as KMeans [29] on x,
y,w, and hindependently to obtain balanced position and
size tokens for each dataset. We show in Sec. 4.7 how quan-
tization strategy affects the resulting quality.
Decoupled Positional Encoding Previous works apply
standard positional encoding to a flattened sequence of lay-
out tokens lflat [2,11,20]. We argue that this flattening ap-
proach could lose the structure information of the layout and
lead to inferior generation performance. In layout, each to-
ken has two types of indices: i-th element and j-th attribute.
We empirically find that independently applying positional
encoding to those indices improves final generation perfor-
mance, which we study in Sec. 4.7.
3.3. Conditional Generation
We elaborate on solving various conditional layout gen-
eration tasks using pre-trained frozen LayoutDM. We inject
conditional information in both the initial state zTand sam-
pled states {zt}T1
t=0 during inference but do not modify the
denoising network pθ. The actual implementation of the in-
jection differs by the type of conditions.
Strong Constraints The most typical condition is par-
tially known layout fields. Let zknown ZNcontain the
known fields and m {0,1}Nbe a mask vector denot-
ing the known and unknown field as 1and 0, respectively.
In each timestep t, we sample ˆ
zt1from pθ(zt1|zt)in
Eq. (3) and then inject the condition by zt1=m
zknown+(1m)ˆ
zt1, where 1denotes a N-dimensional
all-ones vector and denotes element-wise product.
Weak Constraints We may impose a weaker constraint
during generation, such as an element in the center. We
offer a way to impose such constraints in a unified frame-
work without additional training or external neural network
models. We propose to adjust the logits to inject weak con-
straints in log probability space by
log ˆpθ(zt1|zt)log pθ(zt1|zt) + λππ,(6)
where πRN×Kis a prior term that weights the de-
sired outputs, and λπRis a hyper-parameter. The
prior term can be defined either hard-coded (Refinement in
Sec. 4.5) or through differentiable loss functions (Relation-
ship in Sec. 4.5). Let {Li}L
i=1 be a set of differentiable loss
functions given the prediction, the later prior definition can
be written by:
π=−∇pθ(zt1|zt)
L
X
i=1
Li(pθ(zt1|zt)) .(7)
Although the formulation of Eq. (7) resembles steering dif-
fusion models by gradients from external models [7,25],
our primal focus is incorporating classical hand-crafted en-
ergies for aesthetics principles of layout [32] that do not de-
pend on an external model. In practice, we tune the hyper-
parameters for imposing weak constraints, such as λπ. Note
that these hyper-parameters are only for inference and are
easier to tune than the other training hyper-parameters.
4. Experiment
4.1. Datasets
We use two large-scale datasets for comparison, Rico [5]
and PubLayNet [45]. As we mention in Sec. 3.2, an ele-
ment in a layout for each dataset is described by the five
attributes. For preprocessing, we set the maximum number
of elements per layout Mto 25. If a layout contains more
elements, we discard the whole layout.
We provide an overview of each dataset. Rico is a dataset
of user interface designs for mobile applications containing
25 element categories such as text button, toolbar, and icon.
We divide the dataset into 35,851 / 2,109 / 4,218 samples for
train, validation, and test splits. PubLayNet is a dataset of
research papers containing five element categories, such as
table, image, and text. We divide the dataset into 315,757 /
16,619 / 11,142 samples for train, validation, and test splits.
10170
4.2. Evaluation Metrics
We employ two primary metrics: FID and Maximum
IoU (Max.). These metrics take into account both fidelity
and diversity [12], which are two mutually complemen-
tary properties widely used in evaluating generative mod-
els. FID [12] captures the similarity of generated data to
real ones in feature space. We employ an improved feature
extraction model for layouts [19] instead of a conventional
method [21] to compute FID. Maximum IoU [19] measures
the conditional similarity between generated and real lay-
outs. The similarity is measured by computing optimal
matching that maximizes average IoU between generated
and real layouts that have an identical set of categories. For
reference, we show the FID and Maximum IoU computed
between the validation and test data as Real data.
4.3. Tasks and Baselines
We test LayoutDM on six tasks for evaluation.
Unconditional generates layouts without any conditional
input or constraint.
Categorysize+position (CS+P) is a generation task
conditioned on the category of each element [20].
Category+sizeposition (C+SP) is conditioned on the
category and size of each element.
Completion is conditioned on a small number of elements
whose attributes are all known. Given a complete layout,
we randomly sample from 0% to 20% of elements.
Refinement is conditioned on a noisy layout in which
only geometric information is perturbed [36]. Following
RUITE [36], we synthesize the input layout by adding ran-
dom noise to the size and position of each element. We sam-
ple noise from a standard normal distribution with a mean
of 0 and a variance of 0.01.
Relationship is conditioned on the category of each ele-
ment and some relationship constraints between the ele-
ments [21]. Following CLG-LO [19], we employ the size
and location relationships and randomly sample 10% rela-
tionships between elements for the experiment.
The first four tasks handle basic layout fields. We include
a few task-agnostic models for comparison using existing
controllable layout generation methods or simple adaptation
of generative models in the following:
LayoutTrans is a simple autoregressive model [11] trained
on a element-level shuffled layout, following [33]. We set a
variable generation order to cwhxy.
MaskGITis originally a non-autoregressive model for un-
conditional fixed-length data generation [4]. We use [PAD]
to enable variable-length generation.
BLT is a non-autoregressive model with layout-specific de-
coding strategy [20].
BART is a denoising autoencoder that can solve both
comprehension and generation tasks based on Transformer
encoder-decoder backbone [22]. We randomly generate a
number of [MASK] tokens from a uniform distribution be-
tween one and the sequence length, and perform masking
based on the number.
VQDiffusionis a diffusion-based model originally for
text-to-image generation [9]. We adapt the model for layout
using K=C+ 4B+ 2 tokens, including [PAD].
4.4. Implementation Details
We re-implement most of the models since there are few
official implementations publicly available except [11,19,
20]2. We train all the models on the two datasets with three
independent trials and report the average of the results.
LayoutDM follows VQDiffusion for hyper-parameters
unless specified, such as configurations for pθand the tran-
sition matrix parameters i.e.αtand γt. We set the loss
weight λ= 0.1(in Eq. (4)) and the diffusion timesteps
T= 100. For optimization, we use AdamW [27] with
learning rate of 5.0×104,β1= 0.9, and β2= 0.98.
Many models, including LayoutDM, use Trans-
former [40] encoder backbone. We define a shared
configuration as follows: 4 layers, 8 attention heads, 512
embedding dimensions, 2048 hidden dimensions, and 0.1
dropout rate. For other models with extra modules, we ad-
just the number of hidden dimensions to roughly match the
number of parameters for a fair comparison. We randomly
shuffle elements in the layout to avoid fixed-order gener-
ation during training. We search best hyper-parameters to
obtain the best FID using the validation set.
4.5. Quantitative Evaluation
CS+P, C+SP, Completion In these tasks, we in-
ject conditions by masking. We summarize comparisons
in Tab. 1. As task-specific models, we include Layout-
VAE [16], NDN-none [21], and LayoutGAN++ [19] for
CS+P. We also adapt these models for C+SP. Lay-
outDM outperforms other models except LayoutTrans [11]
in completion. The significant performance gap between
LayoutDM and VQDiffusion* suggests the contribution of
our proposals to go beyond the simple discrete diffusion
models discussed in Sec. 3.2. Results in the completion
suggest that a combination of padding and diffusion mod-
els is the primal key to the generation quality. We find that
FID and Maximum IoU are not highly correlated only in the
completion task. We conjecture that Maximum IoU may
become unstable when categories are also predicted, unlike
the CS+P and C+SP tasks where categories are given.
Fig. 3shows the qualitative results of some models, in-
cluding LayoutDM. We can see that LayoutDM generates
2Unfortunately, most datasets have no official train-val-test splits, and
previous approaches work on different splits and pre-processing strategies.
Furthermore, models for FID computation also vary. Thus, we cannot di-
rectly compare our results with the reported figure in the literature.
10171
Table 1. Quantitative comparison in conditional generation given partially known fields. Top two results are highlighted in bold and
underline, respectively. indicates the results of BLT trained with [PAD] as an additional vocabulary since the original model cannot
perform unordered completion in practice.
Task Category Size+Position Category+Size Position Completion
Dataset Rico PubLayNet Rico PubLayNet Rico PubLayNet
Model FID Max. FID Max. FID Max. FID Max. FID Max. FID Max.
Task-specific models
LayoutVAE [16] 33.3 0.249 26.0 0.316 30.6 0.283 27.5 0.315 - - - -
NDN-none [21] 28.4 0.158 61.1 0.162 62.8 0.219 69.4 0.222 - - - -
LayoutGAN++ [19] 6.84 0.267 24.0 0.263 6.22 0.348 9.94 0.342 - - - -
Task-agnostic models
LayoutTrans [11] 5.57 0.223 14.1 0.272 3.73 0.323 16.9 0.320 3.71 0.537 8.36 0.451
MaskGIT[4] 26.1 0.262 17.2 0.319 8.05 0.320 5.86 0.380 33.5 0.533 19.7 0.484
BLT [20] 17.4 0.202 72.1 0.215 4.48 0.340 5.10 0.387 1170.4711310.345
BART [22] 3.97 0.253 9.36 0.320 3.18 0.334 5.88 0.375 8.87 0.527 9.58 0.446
VQDiffusion[9] 4.34 0.252 10.3 0.319 3.21 0.331 7.13 0.374 11.0 0.541 11.1 0.373
LayoutDM 3.55 0.277 7.95 0.310 2.22 0.392 4.25 0.381 9.00 0.576 7.65 0.377
Real data 1.85 0.691 6.25 0.438 1.85 0.691 6.25 0.438 1.85 0.691 6.25 0.438
Rico PubLayNet
Condition Layout-
Trans [11]BLT [20] BART [22] LayoutDM Real Condition Layout-
Trans [11]BLT [20] BART [22] LayoutDM Real
CS+PC+SP
Completion
Figure 3. Comparison in conditional generation given partially known fields.
high-quality layouts with fewer layout aesthetics violations,
such as misalignment and overlap, given diverse conditions.
Unconditional Generation Tab. 2summarizes the results
of unconditional generation. Unconditional layout genera-
tion methods often assume fixed order for element genera-
tion e.g. top-to-bottom rather than random order for better
generation quality by constraining the prediction. For refer-
ence, we additionally report the results of LayoutTrans [11]
trained on the fixed element order (LayoutTrans-fixed). Al-
though we design LayoutDM’s primarily for conditional
generation, LayoutDM achieves the best FID under random
element order setting. We conjecture that BLT’s poor per-
formance is due to train-test mask distribution inconsistency
caused by their hierarchical masking strategy for training.
BLT masks a randomly sampled number of fields from a
single semantic group i.e. category, position, or size. How-
ever, decoding starts with all masked tokens in inference.
The alignment metric of Real data stays at 0.109 in Rico.
Too small alignment values of LayoutTrans and MaskGIT
can be a signal of producing trivial outputs in Rico.
10172
Table 2. Quantitative comparison in unconditional generation. Top
two results are highlighted in bold and underline, respectively.
Dataset Rico PubLayNet
Model FID Align. FID Align.
LayoutTrans-fixed [11]6.47 0.133 17.1 0.084
LayoutTrans [11] 7.63 0.068 13.9 0.127
MaskGIT[4] 52.1 0.015 27.1 0.101
BLT [20] 88.2 1.030 116 0.153
BART [22] 11.9 0.090 16.6 0.116
VQDiffusion[9] 7.46 0.178 15.4 0.193
LayoutDM 6.65 0.162 13.9 0.195
Real data 1.85 0.109 6.25 0.0214
Table 3. Quantitative comparison in the refinement task. Top two
results are highlighted in bold and underline, respectively.
Dataset Rico PubLayNet
Model FID Max. Sim FID Max. Sim
Task-specific models
RUITE [36] 3.23 0.421 0.221 6.39 0.415 0.174
Task-agnostic models
Noisy input 134 0.213 0.177 130 0.242 0.147
LayoutDM 2.77 0.370 0.205 6.75 0.352 0.149
w/o logit adj. 3.55 0.277 0.168 7.95 0.310 0.127
Real data 1.85 0.691 0.260 6.25 0.438 0.216
Input RUITE [36] LayoutDM Real
Rico
PubLayNet
Figure 4. Qualitative comparison in the refinement task.
Refinement Our LayoutDM performs this task with a
combination of the strong constraints of element categories,
i.e., setting zknown ={(c1,[MASK],...,[MASK]), . . .},
and the weak constraints that geometric outputs appear near
noisy inputs. As an example of the weak constraint, we
describe a constraint that imposes the x-coordinate estimate
of i-th element close to the noisy continuous observation ˆxi.
We denote a sliced vector of the prior term πin Eq. (6) that
corresponds to the x-coordinate of i-th element as πi
xRK
100101102
FID
0
10
20
30
40
50
Violation
Rico
100101102
FID
0
10
20
30
40
50
Violation
PubLayNet
LayoutGAN++
w/ CLG-LO
LayoutDM
w/o logit adjustment
NDN-partial
Figure 5. Quality-violation trade-off in the relationship task.
Lower scores indicate better performance for both metrics.
Relationship w/o rel. w/ rel. Real
Rico
PubLayNet
Figure 6. Qualitative comparison in the relationship task.
and define by:
πi
xj=(1if |loc(j)ˆxi|< m and jX
0otherwise,(8)
where mis a hyper-parameter indicating a margin, Xis a
set of indices denoting tokens for xin the vocabularies, and
loc(j)is a function that returns the centroid value of j-th
token in the vocabularies. We define similar constraints for
the other geometric variables and elements.
We summarize the performance in Tab. 3. We addition-
ally report DocSim [34] (Sim) to measure the similarity of a
predicted and its corresponding ground truth layout. Impos-
ing noisy geometric fields as a weak prior significantly im-
proves the masking-only model and makes the performance
much closer to RUITE [36], which is a denoising model not
applicable to other layout tasks. We compare some results
in Fig. 4. Both LayoutDM and RUITE successfully recover
complete layouts from non-trivially noisy layouts.
Relationship We use Eq. (7) to incorporate the relational
constraints during the sampling step of LayoutDM. We fol-
low [19] to employ the loss functions penalizing size and
10173
0 50 100 150
Time per sample [ms]
100
101
102
FID
PubLayNet
LayoutVAE
NDN-none
LayoutGAN++
LayoutTrans.
BART
MaskGIT*
BLT
VQDiffusion*
LayoutDM
Real Data
Figure 7. Speed-quality trade-off of different models for C+SP.
Table 4. Ablation study results on layout-specific modification in
unconditional generation of Rico [5] dataset.
FID Align.
LayoutDM 6.65 0.162
w/o modality-wise diff. 7.32 0.156
w/o decoupled pos. enc. 6.78 0.227
w/ uniform-quantization 7.58 0.256
w/ percentile-quantization 9.79 0.232
Real Data 1.85 0.109
location relationships between elements that do not match
user specifications. We define the loss functions for contin-
uous bounding boxes, and we have to convert the predicted
discrete bounding boxes to continuous ones in a differen-
tiable manner. Given estimated probabilities of discrete x-
coordinates p(x), for example, we compute the continuous
x-coordinate ¯xby ¯x=PnXp(x=n) loc(n). Similar
conversion applies to the other attributes. Empirically, we
find that applying the logit adjustment multiple times (three
times in our experiments) to each diffusion step moderately
improves performance.
We compare LayoutDM with two task-specific ap-
proaches: NDN-partial [21] and CLG-LO based on Lay-
outGAN++ [19]. We show the results in Fig. 5. We ad-
ditionally report constraint violation error rates [19]. Lay-
outDM can control the strength of the logit adjustment as
in Eq. (6) and produces an FID-violation trade-off curve.
LayoutDM is comparable to NDN-partial in Rico and out-
performs NDN-partial by a large margin in PubLayNet. Al-
though LayoutDM is inferior to CLG-LO in both datasets,
note that the average runtime of CLG-LO is 4.0s, which is
much slower than 0.5s in LayoutDM. We show some results
of LayoutDM in Fig. 6.
4.6. Speed-Quality Trade-off
Runtime is also essential for a controllable generation.
We show a speed-quality trade-off curve for C+SP as
shown in Fig. 7. The Transformer encoder-only models,
such as LayoutDM and BLT, can achieve fast generation at
the sacrifice of quality. We employ fast-sampling strategy
employed in discrete diffusion models [3] for LayoutDM
by pθ(zt|zt)P˜
z0q(zt,zt|˜
z0)˜pθ(˜
z0|zt), where
Nindicates a step size for generation in T
steps. De-
spite being a task-agnostic model, LayoutDM achieves the
best quality-speed trade-off except for task-specific Layout-
GAN++ [19] that runs under 10ms.
4.7. Ablation Study
We investigate whether techniques in Sec. 3.2 improve
the performance. First, we evaluate a choice of quantization
methods for the geometric fields of elements. Instead of
KMeans, we compute centroids for the quantization by:
Uniform: This is dataset-agnostic quantization, which is
popular in previous works [2,11,20]. Following [11], we
choose {0.0,1
B,..., B1
B}and {1
B,..., B1
B,1.0}for
the position and size, respectively.
Percentile: we sort the data into equally sized groups and
obtain average values for each group as the centroids.
This is dataset-specific quantization similar to KMeans.
We show the result at the bottom of Tab. 4. We additionally
report the Alignment metric (Align.) used in [19] since the
choice of the quantization affects the alignment between el-
ements. Compared to Linear and Percentile, KMeans quan-
tization significantly improves both FID and Alignment.
We confirm that our modality-wise diffusion and decou-
pled positional encoding both moderately improve the per-
formance, as we show at the top half of Tab. 4.
5. Discussion
LayoutDM is based on diffusion models for discrete
state-space. Using continuous state-space as in latent dif-
fusion models [37] would be interesting. Extension of
LayoutDM to handle various layout properties such as
color [17] and image/text [41] is also appealing.
We believe our proposed logit adjustment can incorpo-
rate more attributes. Attribute-conditional LayoutGAN [24]
considers area, aspect ratio, and reading order of elements
for fine-grained control. Since these attributes can be easily
converted to the size and location relationship constraints,
incorporating them with our LayoutDM is not very difficult.
Potential negative impact Our model might be used to
automatically generate the basic structure of fake websites
or mobile applications, which could lead to scams or the
spreading of misinformation.
10174
References
[1] Maneesh Agrawala, Wilmot Li, and Floraine Berthouzoz.
Design principles for visual communication. Communica-
tions of the ACM, 54(4), 2011. 2
[2] Diego Martin Arroyo, Janis Postels, and Federico Tombari.
Variational transformer networks for layout generation. In
CVPR, 2021. 1,2,3,4,8
[3] Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tar-
low, and Rianne van den Berg. Structured denoising diffu-
sion models in discrete state-spaces. In NeurIPS, 2021. 1,2,
3,8
[4] Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T
Freeman. MaskGIT: Masked generative image transformer.
In CVPR, 2022. 5,6,7
[5] Biplab Deka, Zifeng Huang, Chad Franzen, Joshua Hib-
schman, Daniel Afergan, Yang Li, Jeffrey Nichols, and Ran-
jitha Kumar. Rico: A mobile app dataset for building data-
driven design applications. In UIST, 2017. 1,2,4,8
[6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
Toutanova. BERT: Pre-training of deep bidirectional trans-
formers for language understanding. In NAACL, 2019. 2
[7] Prafulla Dhariwal and Alexander Nichol. Diffusion models
beat gans on image synthesis. In NeurIPS, 2021. 2,4
[8] Tsu-Jui Fu, William Yang Wang, Daniel McDuff, and Yale
Song. DOC2PPT: Automatic presentation slides generation
from scientific documents. In AAAI, 2022. 2
[9] Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo
Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vec-
tor quantized diffusion model for text-to-image synthesis. In
CVPR, 2022. 1,2,3,5,6,7
[10] Shunan Guo, Zhuochen Jin, Fuling Sun, Jingwen Li, Zhaorui
Li, Yang Shi, and Nan Cao. Vinci: an intelligent graphic
design system for generating advertising posters. In CHI,
2021. 2
[11] Kamal Gupta, Alessandro Achille, Justin Lazarow, Larry
Davis, Vijay Mahadevan, and Abhinav Shrivastava. Layout-
Transformer: Layout generation and completion with self-
attention. In ICCV, 2021. 1,2,3,4,5,6,7,8
[12] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,
Bernhard Nessler, and Sepp Hochreiter. Gans trained by a
two time-scale update rule converge to a local nash equilib-
rium. In NeurIPS, 2017. 5
[13] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu-
sion probabilistic models. In NeurIPS, 2020. 1,2
[14] Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick
Forr´
e, and Max Welling. Argmax flows and multinomial
diffusion: Towards non-autoregressive language models. In
NeurIPS, 2021. 1,2
[15] Zhaoyun Jiang, Shizhao Sun, Jihua Zhu, Jian-Guang Lou,
and Dongmei Zhang. Coarse-to-fine generative modeling for
graphic layouts. In AAAI, 2022. 2
[16] Akash Abdu Jyothi, Thibaut Durand, Jiawei He, Leonid Si-
gal, and Greg Mori. LayoutVAE: Stochastic scene layout
generation from a label set. In CVPR, 2019. 2,5,6
[17] Kotaro Kikuchi, Naoto Inoue, Mayu Otani, Edgar Simo-
Serra, and Kota Yamaguchi. Generative colorization of struc-
tured mobile web pages. In WACV, 2023. 8
[18] Kotaro Kikuchi, Mayu Otani, Kota Yamaguchi, and Edgar
Simo-Serra. Modeling visual containment for web page lay-
out optimization. Computer Graphics Forum, 40(7), 2021.
2
[19] Kotaro Kikuchi, Edgar Simo-Serra, Mayu Otani, and Kota
Yamaguchi. Constrained graphic layout generation via latent
optimization. In ACM MM, 2021. 1,2,5,6,7,8
[20] Xiang Kong, Lu Jiang, Huiwen Chang, Han Zhang, Yuan
Hao, Haifeng Gong, and Irfan Essa. BLT: Bidirectional lay-
out transformer for controllable layout generation. In ECCV,
2022. 1,2,3,4,5,6,7,8
[21] Hsin-Ying Lee, Weilong Yang, Lu Jiang, Madison Le, Ir-
fan Essa, Haifeng Gong, and Ming-Hsuan Yang. Neural de-
sign network: Graphic layout generation with constraints. In
ECCV, 2020. 1,2,5,6,8
[22] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvinine-
jad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and
Luke Zettlemoyer. BART: Denoising sequence-to-sequence
pre-training for natural language generation, translation, and
comprehension. In ACL, 2020. 5,6,7
[23] Jianan Li, Jimei Yang, Aaron Hertzmann, Jianming Zhang,
and Tingfa Xu. LayoutGAN: Generating graphic layouts
with wireframe discriminators. In ICLR, 2019. 2
[24] Jianan Li, Jimei Yang, Jianming Zhang, Chang Liu,
Christina Wang, and Tingfa Xu. Attribute-conditioned lay-
out gan for automatic graphic design. IEEE TVCG, 27(10),
2020. 8
[25] Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and
Joshua B Tenenbaum. Compositional visual generation with
composable diffusion models. In ECCV, 2022. 4
[26] Simon Lok and Steven Feiner. A survey of automated layout
techniques for information presentations. In SmartGraphics,
2001. 2
[27] Ilya Loshchilov and Frank Hutter. Decoupled weight decay
regularization. In ICLR, 2019. 5
[28] Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher
Yu, Radu Timofte, and Luc Van Gool. RePaint: Inpainting
using denoising diffusion probabilistic models. In CVPR,
2022. 2
[29] J MacQueen. Classification and analysis of multivariate ob-
servations. In 5th Berkeley Symp. Math. Statist. Probability,
1967. 4
[30] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jia-
jun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided
image synthesis and editing with stochastic differential equa-
tions. In ICLR, 2022. 2
[31] Paul Merrell, Eric Schkufza, Zeyang Li, Maneesh Agrawala,
and Vladlen Koltun. Interactive furniture layout using inte-
rior design guidelines. ACM TOG, 30(4), 2011. 2
[32] Peter O’Donovan, Aseem Agarwala, and Aaron Hertzmann.
Learning layouts for single-pagegraphic designs. IEEE
TVCG, 20(8), 2014. 2,4
[33] Despoina Paschalidou, Amlan Kar, Maria Shugrina, Karsten
Kreis, Andreas Geiger, and Sanja Fidler. ATISS: Autoregres-
sive transformers for indoor scene synthesis. In NeurIPS,
2021. 2,5
10175
[34] Akshay Gadi Patil, Omri Ben-Eliezer, Or Perel, and Hadar
Averbuch-Elor. READ: Recursive autoencoders for docu-
ment layout generation. In CVPRW, 2020. 7
[35] Chunyao Qian, Shizhao Sun, Weiwei Cui, Jian-Guang Lou,
Haidong Zhang, and Dongmei Zhang. Retrieve-then-adapt:
Example-based automatic generation for proportion-related
infographics. IEEE TVCG, 27(2), 2020. 2
[36] Soliha Rahman, Vinoth Pandian Sermuga Pandian, and
Matthias Jarke. RUITE: Refining ui layout aesthetics using
transformer encoder. In 26th International Conference on
Intelligent User Interfaces-Companion, 2021. 2,5,7
[37] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
Patrick Esser, and Bj¨
orn Ommer. High-resolution image syn-
thesis with latent diffusion models. In CVPR, 2022. 8
[38] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan,
and Surya Ganguli. Deep unsupervised learning using
nonequilibrium thermodynamics. In ICML, 2015. 2,3
[39] Yang Song and Stefano Ermon. Improved techniques for
training score-based generative models. In NeurIPS, 2020.
2
[40] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. Attention is all you need. In NeurIPS, 2017. 3,
5
[41] Kota Yamaguchi. CanvasVAE: Learning to generate vector
graphics documents. In ICCV, 2021. 2,8
[42] Xuyong Yang, Tao Mei, Ying-Qing Xu, Yong Rui, and
Shipeng Li. Automatic generation of visual-textual presen-
tation layout. TOMM, 12(2), 2016. 2
[43] Lap Fai Yu, Sai Kit Yeung, Chi Keung Tang, Demetri Ter-
zopoulos, Tony F Chan, and Stanley J Osher. Make it home:
automatic optimization of furniture arrangement. ACM TOG,
30(4), 2011. 2
[44] Xinru Zheng, Xiaotian Qiao, Ying Cao, and Rynson WH
Lau. Content-aware generative modeling of graphic design
layouts. ACM TOG, 38(4), 2019. 2
[45] Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes. Pub-
LayNet: largest dataset ever for document layout analysis. In
ICDAR, 2019. 1,2,4
10176
... There are a few recent works adopting diffusion models or iterative denoising strategies for 3D scene synthesis problems [17,13,46,41]. The closest work to ours is DiffuScene [41], which generates 3D indoor scenes without floor plan constraints using DDPM by first converting semantic labels to one-hot encoding vectors in the continuous domain and then using argmax to retrieve label predictions. ...
... Diffusion Models for Layout Synthesis. Very recently, diffusion models became popular for layout synthesis, including document layout generation [17,13], 3D scene synthesis [41], furniture re-arrangement [46,41], graph-conditioned 3D layout generation [52,24], and text conditioned scene synthesis [7]. LayoutDM [17] applies diffusion models in discrete state space. ...
... Very recently, diffusion models became popular for layout synthesis, including document layout generation [17,13], 3D scene synthesis [41], furniture re-arrangement [46,41], graph-conditioned 3D layout generation [52,24], and text conditioned scene synthesis [7]. LayoutDM [17] applies diffusion models in discrete state space. After discretizing position and size to a fixed number of bins, they use the discrete corruption process by VQ-Diffusion [11] to train the reverse transformer network for predicting category, position, and size for document layout generation. ...
Preprint
Full-text available
Realistic conditional 3D scene synthesis significantly enhances and accelerates the creation of virtual environments, which can also provide extensive training data for computer vision and robotics research among other applications. Diffusion models have shown great performance in related applications, e.g., making precise arrangements of unordered sets. However, these models have not been fully explored in floor-conditioned scene synthesis problems. We present MiDiffusion, a novel mixed discrete-continuous diffusion model architecture, designed to synthesize plausible 3D indoor scenes from given room types, floor plans, and potentially pre-existing objects. We represent a scene layout by a 2D floor plan and a set of objects, each defined by its category, location, size, and orientation. Our approach uniquely implements structured corruption across the mixed discrete semantic and continuous geometric domains, resulting in a better conditioned problem for the reverse denoising step. We evaluate our approach on the 3D-FRONT dataset. Our experimental results demonstrate that MiDiffusion substantially outperforms state-of-the-art autoregressive and diffusion models in floor-conditioned 3D scene synthesis. In addition, our models can handle partial object constraints via a corruption-and-masking strategy without task specific training. We show MiDiffusion maintains clear advantages over existing approaches in scene completion and furniture arrangement experiments.
... City scene generation combines detailed urban planning, including road networks, land use, and building placement, using techniques from rule-based designs [8,10] to procedural tools like CityEngine [2] and Unreal Engine [5], and deep learning [43,79,61]. While diffusion models [29,36] often simplify layouts to basic elements, limiting complexity, Neural Radiance Fields (NeRF) [43,79,20] produce high-quality visuals but are computationally expensive. In contrast, CityGen [23] combines Stable Diffusion [63] with Low-Rank Adaptation (LoRA) [34] on the MatrixCity dataset [41], resulting in more realistic and controllable city scenes through ControlNet [86]. ...
... We compare the performance of layout generation with other models that generate city layouts as semantic masks, including Infinicity [23], CityDreamer [79], and CityGen [23]. We do not include comparisons with other methods that use object bounding boxes to represent city layouts, such as [59,37,36]. These methods typically limit object types to only buildings and roads and impose constraints on object shapes. ...
Preprint
Full-text available
City scene generation has gained significant attention in autonomous driving, smart city development, and traffic simulation. It helps enhance infrastructure planning and monitoring solutions. Existing methods have employed a two-stage process involving city layout generation, typically using Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), or Transformers, followed by neural rendering. These techniques often exhibit limited diversity and noticeable artifacts in the rendered city scenes. The rendered scenes lack variety, resembling the training images, resulting in monotonous styles. Additionally, these methods lack planning capabilities, leading to less realistic generated scenes. In this paper, we introduce CityCraft, an innovative framework designed to enhance both the diversity and quality of urban scene generation. Our approach integrates three key stages: initially, a diffusion transformer (DiT) model is deployed to generate diverse and controllable 2D city layouts. Subsequently, a Large Language Model(LLM) is utilized to strategically make land-use plans within these layouts based on user prompts and language guidelines. Based on the generated layout and city plan, we utilize the asset retrieval module and Blender for precise asset placement and scene construction. Furthermore, we contribute two new datasets to the field: 1)CityCraft-OSM dataset including 2D semantic layouts of urban areas, corresponding satellite images, and detailed annotations. 2) CityCraft-Buildings dataset, featuring thousands of diverse, high-quality 3D building assets. CityCraft achieves state-of-the-art performance in generating realistic 3D cities.
... With the advance in deep learning, researchers are glad to embrace data-driven methods [2,12,15,[17][18][19] in layout generation. Most of these works focus on adopting the latest generative architecture but overlook the necessary conditional requirements for layout. ...
... LayoutGAN [21] employs the GAN (Generative Adversarial Network) paradigm and designs a differentiable rendering process for connecting the visual and graphic domains. LayoutVAE [17] and CanvasVAE [39] adopt the VAE (Variational Auto-Encoder) paradigm, while more recent works adopt the auto-regressive architecture [2,12,19] or the diffusion architecture [14,15,44]. Despite their achievement on unconditioned layout generation tasks, they are hard to use in real-world scenarios. ...
Preprint
Full-text available
Layout generation is the keystone in achieving automated graphic design, requiring arranging the position and size of various multi-modal design elements in a visually pleasing and constraint-following manner. Previous approaches are either inefficient for large-scale applications or lack flexibility for varying design requirements. Our research introduces a unified framework for automated graphic layout generation, leveraging the multi-modal large language model (MLLM) to accommodate diverse design tasks. In contrast, our data-driven method employs structured text (JSON format) and visual instruction tuning to generate layouts under specific visual and textual constraints, including user-defined natural language specifications. We conducted extensive experiments and achieved state-of-the-art (SOTA) performance on public multi-modal layout generation benchmarks, demonstrating the effectiveness of our method. Moreover, recognizing existing datasets' limitations in capturing the complexity of real-world graphic designs, we propose two new datasets for much more challenging tasks (user-constrained generation and complicated poster), further validating our model's utility in real-life settings. Marking by its superior accessibility and adaptability, this approach further automates large-scale graphic design tasks. The code and datasets will be publicly available on https://github.com/posterllava/PosterLLaVA.
... Recent studies utilize sequential models like Transformers [31] to generate layout elements as sequences [10,2,16,15]. LayoutDM and PLay [13,8] demonstrate conditional layout generation capabilities using diffusion models. Two recent works [21,20] leverages LLMs for few shot layout generation, but they either require specific selection and ranking process for examples or only use LLMs to parse the input. ...
Preprint
Layout design, such as user interface or graphical layout in general, is fundamentally an iterative revision process. Through revising a design repeatedly, the designer converges on an ideal layout. In this paper, we investigate how revision edits from human designer can benefit a multimodal generative model. To do so, we curate an expert dataset that traces how human designers iteratively edit and improve a layout generation with a prompted language goal. Based on such data, we explore various supervised fine-tuning task setups on top of a Gemini multimodal backbone, a large multimodal model. Our results show that human revision plays a critical role in iterative layout refinement. While being noisy, expert revision edits lead our model to a surprisingly strong design FID score ~10 which is close to human performance (~6). In contrast, self-revisions that fully rely on model's own judgement, lead to an echo chamber that prevents iterative improvement, and sometimes leads to generative degradation. Fortunately, we found that providing human guidance plays at early stage plays a critical role in final generation. In such human-in-the-loop scenario, our work paves the way for iterative design revision based on pre-trained large multimodal models.
... The field of document generation presents unique challenges in seamlessly integrating visual elements such as style, layout, and multimedia with textual content, posing new problems for the vision community. * Work done during internship at Adobe Research Document layout generation [1,7,9,10,14,16,19] has played a crucial role in numerous applications, ranging from automated report creation to dynamic webpage design, significantly impacting how information is perceived and interacted with by users. With large language models (LLMs) [3,27] becoming more and more capable of compositional reasoning of visual concepts [6], it opens further avenues for exploiting autoregressive approaches in the automatic end-to-end generation of both document content and layout structure. ...
Preprint
Full-text available
While the generation of document layouts has been extensively explored, comprehensive document generation encompassing both layout and content presents a more complex challenge. This paper delves into this advanced domain, proposing a novel approach called DocSynthv2 through the development of a simple yet effective autoregressive structured model. Our model, distinct in its integration of both layout and textual cues, marks a step beyond existing layout-generation approaches. By focusing on the relationship between the structural elements and the textual content within documents, we aim to generate cohesive and contextually relevant documents without any reliance on visual components. Through experimental studies on our curated benchmark for the new task, we demonstrate the ability of our model combining layout and textual information in enhancing the generation quality and relevance of documents, opening new pathways for research in document creation and automated design. Our findings emphasize the effectiveness of autoregressive models in handling complex document generation tasks.
... Diffusion Models for Discrete Data. Several approaches to discrete generation using diffusion models have been developed [4,28,53,17]. For graph generation specifically, [50] utilize a Markov process that progressively edits graphs by adding or removing edges and altering node or edge categories and is trained using a graph transformer network, that reverses this process to predict the original graph structure from its noisy version. ...
Preprint
We present a formulation of flow matching as variational inference, which we refer to as variational flow matching (VFM). Based on this formulation we develop CatFlow, a flow matching method for categorical data. CatFlow is easy to implement, computationally efficient, and achieves strong results on graph generation tasks. In VFM, the objective is to approximate the posterior probability path, which is a distribution over possible end points of a trajectory. We show that VFM admits both the CatFlow objective and the original flow matching objective as special cases. We also relate VFM to score-based models, in which the dynamics are stochastic rather than deterministic, and derive a bound on the model likelihood based on a reweighted VFM objective. We evaluate CatFlow on one abstract graph generation task and two molecular generation tasks. In all cases, CatFlow exceeds or matches performance of the current state-of-the-art models.
... The AIGC technology is widely applied in the task of generating advertising images. In the previous works, creative generation methods in the advertising scene use deep learning to generate some objects/tags [25,38], dense captions [6] or layout information [12,29] on image. CG4CTR [25] use use diffusion model to generate background images while keeping the main product information unchanged in creative generation task for the advertising scene. ...
Preprint
Artificial Intelligence Generated Content(AIGC), known for its superior visual results, represents a promising mitigation method for high-cost advertising applications. Numerous approaches have been developed to manipulate generated content under different conditions. However, a crucial limitation lies in the accurate description of products in advertising applications. Applying previous methods directly may lead to considerable distortion and deformation of advertised products, primarily due to oversimplified content control conditions. Hence, in this work, we propose a patch-enhanced mask encoder approach to ensure accurate product descriptions while preserving diverse backgrounds. Our approach consists of three components Patch Flexible Visibility, Mask Encoder Prompt Adapter and an image Foundation Model. Patch Flexible Visibility is used for generating a more reasonable background image. Mask Encoder Prompt Adapter enables region-controlled fusion. We also conduct an analysis of the structure and operational mechanisms of the Generation Module. Experimental results show our method can achieve the highest visual results and FID scores compared with other methods.
... Recent studies on layout generation use sequential models like Transformers (Vaswani et al., 2017) to output the layout elements as sequences (Gupta et al., 2021;Arroyo et al., 2021;Kong et al., 2022;Kikuchi et al., 2021). LayoutDM and PLay (Inoue et al., 2023;Cheng et al., 2023) show results in conditional layout generation. We choose PLay as our backbone model based on its flexibility to inject various conditions using latent diffusion (Rombach et al., 2022a) and classifier-free guidance (Ho & Salimans, 2022). ...
Preprint
Learning from human feedback has shown success in aligning large, pretrained models with human values. Prior works have mostly focused on learning from high-level labels, such as preferences between pairs of model outputs. On the other hand, many domains could benefit from more involved, detailed feedback, such as revisions, explanations, and reasoning of human users. Our work proposes using nuanced feedback through the form of human revisions for stronger alignment. In this paper, we ask expert designers to fix layouts generated from a generative layout model that is pretrained on a large-scale dataset of mobile screens. Then, we train a reward model based on how human designers revise these generated layouts. With the learned reward model, we optimize our model with reinforcement learning from human feedback (RLHF). Our method, Revision-Aware Reward Models ($\method$), allows a generative text-to-layout model to produce more modern, designer-aligned layouts, showing the potential for utilizing human revisions and stronger forms of feedback in improving generative models.
Preprint
Full-text available
Automatic generation of graphic designs has recently received considerable attention. However, the state-of-the-art approaches are complex and rely on proprietary datasets, which creates reproducibility barriers. In this paper, we propose an open framework for automatic graphic design called OpenCOLE, where we build a modified version of the pioneering COLE and train our model exclusively on publicly available datasets. Based on GPT4V evaluations, our model shows promising performance comparable to the original COLE. We release the pipeline and training results to encourage open development.
Preprint
A poster from a long input document can be considered as a one-page easy-to-read multimodal (text and images) summary presented on a nice template with good design elements. Automatic transformation of a long document into a poster is a very less studied but challenging task. It involves content summarization of the input document followed by template generation and harmonization. In this work, we propose a novel deep submodular function which can be trained on ground truth summaries to extract multimodal content from the document and explicitly ensures good coverage, diversity and alignment of text and images. Then, we use an LLM based paraphraser and propose to generate a template with various design aspects conditioned on the input content. We show the merits of our approach through extensive automated and human evaluations.
Chapter
Creating visual layouts is a critical step in graphic design. Automatic generation of such layouts is essential for scalable and diverse visual designs. To advance conditional layout generation, we introduce BLT, a bidirectional layout transformer. BLT differs from previous work on transformers in adopting non-autoregressive transformers. In training, BLT learns to predict the masked attributes by attending to surrounding attributes in two directions. During inference, BLT first generates a draft layout from the input and then iteratively refines it into a high-quality layout by masking out low-confident attributes. The masks generated in both training and inference are controlled by a new hierarchical sampling policy. We verify the proposed model on six benchmarks of diverse design tasks. Experimental results demonstrate two benefits compared to the state-of-the-art layout transformer models. First, our model empowers layout transformers to fulfill controllable layout generation. Second, it achieves up to 10x speedup in generating a layout at inference time than the layout transformer baseline. Code is released at https://shawnkx.github.io/blt.KeywordsDesignLayout creationTransformerNon-autoregressive
Chapter
Large text-guided diffusion models, such as DALLE-2, are able to generate stunning photorealistic images given natural language descriptions. While such models are highly flexible, they struggle to understand the composition of certain concepts, such as confusing the attributes of different objects or relations between objects. In this paper, we propose an alternative structured approach for compositional generation using diffusion models. An image is generated by composing a set of diffusion models, with each of them modeling a certain component of the image. To do this, we interpret diffusion models as energy-based models in which the data distributions defined by the energy functions may be explicitly combined. The proposed method can generate scenes at test time that are substantially more complex than those seen in training, composing sentence descriptions, object relations, human facial attributes, and even generalizing to new combinations that are rarely seen in the real world. We further illustrate how our approach may be used to compose pre-trained text-guided diffusion models and generate photorealistic images containing all the details described in the input descriptions, including the binding of certain object attributes that have been shown difficult for DALLE-2. These results point to the effectiveness of the proposed method in promoting structured generalization for visual generation.KeywordsCompositionalityDiffusion modelsEnergy-based modelsVisual generation
Article
Even though graphic layout generation has attracted growing attention recently, it is still challenging to synthesis realistic and diverse layouts, due to the complicated element relationships and varied element arrangements. In this work, we seek to improve the performance of layout generation by incorporating the concept of regions, which consist of a smaller number of elements and appears like a simple layout, into the generation process. Specifically, we leverage Variational Autoencoder (VAE) as the overall architecture and decompose the decoding process into two stages. The first stage predicts representations for regions, and the second stage fills in the detailed position for each element within the region based on the predicted region representation. Compared to prior studies that merely abstract the layout into a list of elements and generate all the element positions in one go, our approach has at least two advantages. First, by the two-stage decoding, our approach decouples the complex layout generation task into several simple layout generation tasks, which reduces the problem difficulty. Second, the predicted regions can help the model roughly know what the graphic layout looks like and serve as global context to improve the generation of detailed element positions. Qualitative and quantitative experiments demonstrate that our approach significantly outperforms the existing methods, especially on the complex graphic layouts.
Article
Creating presentation materials requires complex multimodal reasoning skills to summarize key concepts and arrange them in a logical and visually pleasing manner. Can machines learn to emulate this laborious process? We present a novel task and approach for document-to-slide generation. Solving this involves document summarization, image and text retrieval, slide structure and layout prediction to arrange key elements in a form suitable for presentation. We propose a hierarchical sequence-to-sequence approach to tackle our task in an end-to-end manner. Our approach exploits the inherent structures within documents and slides and incorporates paraphrasing and layout prediction modules to generate slides. To help accelerate research in this domain, we release a dataset about 6K paired documents and slide decks used in our experiments. We show that our approach outperforms strong baselines and produces slides with rich content and aligned imagery.