Conference PaperPDF Available

LayoutDM: Discrete Diffusion Model for Controllable Layout Generation

June 2023

June 2023

DOI:10.1109/CVPR52729.2023.00980

Conference: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Authors:

Naoto Inoue

CyberAgent Inc, Japan, Tokyo

Show all 5 authorsHide

Quantitative comparison in unconditional generation. Top two results are highlighted in bold and underline, respectively.

…

Figures - uploaded by Naoto Inoue

Content may be subject to copyright.

Content uploaded by Naoto Inoue

Content may be subject to copyright.

LayoutDM: Discrete Diffusion Model for Controllable Layout Generation

Naoto Inoue1Kotaro Kikuchi1Edgar Simo-Serra2Mayu Otani1Kota Yamaguchi1

1CyberAgent, Japan 2Waseda University, Japan

{inoue naoto, kikuchi kotaro xa}@cyberagent.co.jp ess@waseda.jp

{otani mayu, yamaguchi kota}@cyberagent.co.jp

Abstract

Controllable layout generation aims at synthesizing

plausible arrangement of element bounding boxes with op-

tional constraints, such as type or position of a speciﬁc el-

ement. In this work, we try to solve a broad range of lay-

out generation tasks in a single model that is based on dis-

crete state-space diffusion models. Our model, named Lay-

outDM, naturally handles the structured layout data in the

discrete representation and learns to progressively infer a

noiseless layout from the initial input, where we model the

layout corruption process by modality-wise discrete diffu-

sion. For conditional generation, we propose to inject lay-

out constraints in the form of masking or logit adjustment

during inference. We show in the experiments that our Lay-

outDM successfully generates high-quality layouts and out-

performs both task-speciﬁc and task-agnostic baselines on

several layout tasks.1

1. Introduction

Graphic layouts play a critical role in visual communica-

tion. Automatically creating a visually pleasing layout has

tremendous application beneﬁts that range from authoring

of printed media [45] to designing application user inter-

face [5], and there has been a growing research interest in

the community. The task of layout generation considers the

arrangement of elements, where each element has a tuple

of attributes, such as category, position, or size, and de-

pending on the task setup, there could be optional control

inputs that specify part of the elements or attributes. Due

to the structured nature of layout data, it is crucial to con-

sider relationships between elements in a generation. For

this reason, current generation approaches either build an

autoregressive model [2,11] or develop a dedicated infer-

ence strategy to explicitly consider relationships [19–21].

In this paper, we propose to utilize discrete state-space

1Please ﬁnd the code and models at:

https://cyberagentailab.github.io/layout-dm.

...

Category

Generated layout(s)

LayoutDM

Category +sizeCompletion

Refinement Category+relationship

... ...

4 33 211...M M M M ...M M ...

corrupt

conditio

(optional)

denoise

Initial layout Final layout

Figure 1. Overview of LayoutDM. Top: LayoutDM is trained to

gradually generate a complete layout from a blank state in discrete

state space. Bottom: During sampling, we can steer LayoutDM

to perform various conditional generation tasks without additional

training or external models.

diffusion models [3,9,14] for layout generation tasks. Dif-

fusion models have shown promising performance for var-

ious generation tasks, including images and texts [13].

We formulate the diffusion process for layout structure by

modality-wise discrete diffusion, and train a denoising back-

bone network to progressively infer the complete layout

with or without conditional inputs. To support variable-

length layout data, we extend the discrete state-space with

a special PAD token instead of the typical end-of-sequence

token used in autoregressive models. Our model can incor-

porate complex layout constraints via logit adjustment, so

that we can reﬁne an existing layout or impose relative size

constraints between elements without additional training.

We discuss two key advantages of LayoutDM over ex-

isting models for conditional layout generation. Our model

avoids the immutable dependency chain issue [20] that hap-

This CVPR paper is the Open Access version, provided by the Computer Vision Foundation.

Except for this watermark, it is identical to the accepted version;

the final published version of the proceedings is available on IEEE Xplore.

10167

pens in autoregressive models [11]. Autoregressive mod-

els fail to perform conditional generation when the con-

dition disagrees with the pre-deﬁned generation order of

elements and attributes. Unlike non-autoregressive mod-

els [20], our model can generate variable-length elements.

We empirically show in Sec. 4.5 that naively extending

non-autoregressive models by padding results in suboptimal

variable length generation while padding combined with

our diffusion formulation leads to signiﬁcant improvement.

We evaluate LayoutDM on various layout generation

tasks tackled by previous works [20,21,33,36] using two

large-scale datasets, Rico [5] and PubLayNet [45]. Lay-

outDM outperforms task-agnostic baselines in the major-

ity of cases and shows promising performance compared

with task-speciﬁc baselines. We further conduct an ablation

study to prove the signiﬁcant impact of our design choices

in LayoutDM, including quantization of continuous vari-

ables and positional embedding.

We summarize our contributions as follows:

• We formulate the discrete diffusion process for layout

generation and propose a modality-wise diffusion and a

padding approach to model highly structured layout data.

• We propose to inject complex layout constraints via

masking and logit adjustment during the inference, so that

our model can solve diverse tasks in a single model.

• We empirically show solid performance for various con-

ditional layout generation tasks on public datasets.

2. Related Work

2.1. Layout Generation

Studies on automatic layout generation have appeared

several times in literature [1,26,32,42]. Layout tasks are

commonly observed in design applications, including mag-

azine covers, posters, presentation slides, application user

interface, or banner advertising [5,8,10,18,35,41,42,44].

Recent approaches to layout generation consider both un-

conditional generation [2,11,15,16] and conditional gener-

ation in various setups, such as conditional inputs of cate-

gory or size [19–21,23], relational constraints [19,21], el-

ement completion [11], and reﬁnement [36]. Some attempt

at solving multiple tasks in a single model [20,33].

BLT [20] points out that the recent autoregressive de-

coders [2,11] are not fully capable of considering partial

inputs, i.e. known elements or attributes, during generation

because they have a ﬁxed generation order. BLT addresses

the conditional generation by ﬁll-in-the-blank task formu-

lation using a bidirectional Transformer encoder similar to

masked language models [6]. However, BLT cannot solve

layout completion demonstrated in the decoder-based mod-

els because of the requirement of the known number of el-

ements. Our LayoutDM enjoys the best of both worlds and

supports a broader range of conditional generation tasks in

a single model.

Another layout-speciﬁc consideration is the complex

user-speciﬁed constraints, such as the positional require-

ments between two boxes (e.g., a header box should be on

top of a paragraph box). Earlier approaches [31,32,43]

propose hand-crafted cost functions representing the vio-

lation degree of aesthetic constraints so that those con-

straints guide the optimization process of layout inference.

CLG-LO [19] proposes an aesthetically constrained opti-

mization framework for pre-trained GANs. Our LayoutDM

solves such constrained generation tasks on top of the task-

agnostic iterative prediction via logit adjustment.

2.2. Discrete Diffusion Models

Diffusion models [38] are generative models character-

ized by a forward and reverse Markov process. The for-

ward process corrupts the data into a sequence of increas-

ingly noisy variables. The reverse process gradually de-

noises the variables toward the actual data distribution. Dif-

fusion models are stable to train and achieve faster sampling

than autoregressive models by parallel iterative reﬁnement.

Recently, many approaches have learned the reverse pro-

cess by a neural network and show strong empirical perfor-

mance [7,13,39] in continuous state spaces, such as images.

Discrete state spaces are a natural representation of dis-

crete variables, such as text. D3PM [3] extends the pioneer-

ing work of Hoogeboom et al. [14] to structured categor-

ical corruption processes for diffusion models in discrete

state spaces, while maintaining the advantages of diffusion

models for continuous state spaces. VQDiffusion [9] devel-

ops a corruption approach called mask-and-replace, so as

to avoid accumulated prediction errors that are common in

models based on iterative prediction. Following the corrup-

tion model of VQDiffusion, we carefully design a modality-

wise corruption process for layout tasks that involve tokens

from disjoint sets of vocabulary per modality.

Several studies consider a conditional input to the infer-

ence process of diffusion models. Some approaches alter

the reverse diffusion iteration to carefully inject given con-

ditions for free-form image inpainting [28] or image edit-

ing by strokes or composition [30]. We extend the discrete

state-space diffusion models via hard masking or logit ad-

justment to support the conditional generation of layouts.

3. LayoutDM

Our LayoutDM builds on discrete-state space diffusion

models [3,9]. We ﬁrst brieﬂy review the fundamental of

discrete diffusion models in Sec. 3.1. Sec. 3.2 explains our

approach to layout generation within the diffusion frame-

work while discussing features inherent in layout compared

with text. Sec. 3.3 discusses how we extend denoising steps

to perform various conditional layout generation by impos-

ing conditions in each step of the reverse process.

10168

3.1. Preliminary: Discrete Diffusion Models

Diffusion models [38] are generative models character-

ized by a forward and reverse Markov process. While

many diffusion models are deﬁned on continuous space

with Gaussian corruption, D3PM [3] introduces a general

diffusion framework for categorical variables designed pri-

marily for texts. Let T∈Nbe a total timestep of the dif-

fusion model, we ﬁrst explain the forward diffusion pro-

cess. For a scalar discrete variable with Kcategories

zt∈ {1,2, . . . , K}at timestep t∈N, probabilities that

zt−1transits to ztare deﬁned by using a transition matrix

Qt∈[0,1]K×K, with [Qt]mn =q(zt=m|zt−1=n),

q(zt|zt−1) = v(zt)

⊤Qtv(zt−1),(1)

where v(zt)∈ {0,1}Kis a column one-hot vector of zt.

The categorical distribution over ztgiven zt−1is computed

by a column vector Qtv(zt−1)∈[0,1]K. Assuming the

Markov property, we can derive q(zt|z0) = v(zt)

⊤Qtv(z0)

where Qt=QtQt−1· · · Q1and:

q(zt−1|zt, z0) = q(zt|zt−1, z0)q(zt−1|z0)

q(zt|z0)

=v(zt)

⊤Qtv(zt−1)v(zt−1)

⊤Qt−1v(z0)

v(zt)

⊤Qtv(z0).(2)

Note that due to the Markov property, q(zt|zt−1, z0) =

q(zt|zt−1). When we consider N-dimensional variables

zt∈ {1,2, . . . , K}N, the corruption is applied to each vari-

able ztindependently. In the following, we explain with

N-dimensional variables zt.

In contrast to the forward process, the reverse denois-

ing process considers a conditional distribution of zt−1over

ztby a neural network pθ(zt−1|zt)∈[0,1]N×K.zt−1is

sampled according to this distribution. Note that the typical

implementation is to predict unnormalized log probabilities

log pθ(zt−1|zt)by a stack of bidirectional Transformer en-

coder blocks. D3PM uses a neural network ˜pθ(˜

z0|zt), com-

bines it with the posterior q(zt−1|zt,z0), and sums over

possible ˜

z0to obtain the following parameterization:

pθ(zt−1|zt)∝X

q(zt−1|zt,˜

z0) ˜pθ(˜

z0|zt).(3)

In addition to the commonly used variational lower

bound objective Lvb, D3PM introduces an auxiliary denois-

ing objective. The overall objective is as follows:

Lλ=Lvb +λE

zt∼q(zt|z0)

z0∼q(z0)

[−log ˜pθ(z0|zt)] ,(4)

where λis a hyper-parameter to balance the two loss terms.

Although D3PM proposes many variants of Qt, VQD-

iffusion [9] offers an improved version of Qtcalled mask-

and-replace strategy. They introduce an additional special

...

Figure 2. Overview of the corruption and denoising processes in

LayoutDM. For simplicity, we use a toy layout consisting of two

elements and the model generates three elements at maximum.

token [MASK] and three probabilities γtof replacing the

current token with the [MASK] token, βtof replacing the

token with other tokens, and αtof not changing the token.

The [MASK] token never transitions to other states. The

transition matrix Qt∈[0,1](K+1)×(K+1) is deﬁned by:

Qt=





αt+βtβt· · · βt0

βtαt+βt· · · βt0

....βt0

βtβtβtαt+βt0

γtγtγtγt1







.(5)

(αt, βt, γt)is carefully designed so that ztconverges to the

[MASK] token for sufﬁciently large t. During testing, we

start from zTﬁlled with [MASK] tokens and iteratively

sample new set of tokens zt−1from pθ(zt−1|zt).

3.2. Unconditional Layout Generation

A layout lis a set of elements represented by l=

{(c1,b1),...,(cE,bE)}.E∈Nis the number of ele-

ments in the layout. ci∈ {1, . . . , C}is categorical infor-

mation of the i-th element in the layout. bi∈[0,1]4is

the bounding box of the i-th element in normalized coordi-

nates, where the ﬁrst two values indicate the center location,

and the last two indicate the width and height. Following

previous works [2,11,20] that regard layout generation as

generating a sequence of tokens, we quantize each value in

biand obtain [xi, yi, wi, hi]⊤∈ {1, . . . , B}4, where Bis

a number of the bins. The layout lis now represented by

l={(c1, x1, y1, w1, h1), . . .}.

In this work, we corrupt a layout in a modality-wise

manner in the forward process, and we denoise the cor-

rupted layout while considering all elements and modal-

ities in the reverse process, as we illustrate in Fig. 2.

Similarly to D3PM [3], we parameterize pθby a Trans-

former encoder [40], which processes an ordered 1D se-

quence. To process lby pθwhile avoiding the or-

der dependency issue [20], we randomly shufﬂe lin

element-wise manner and then ﬂatten it to produce lﬂat =

(c1, x1, y1, w1, h1, c2, x2, . . .).

10169

Variable length generation Existing diffusion models

generate ﬁxed-dimensional data and are not directly appli-

cable to the layout generation because the number of ele-

ments in each layout varies. To handle this, we introduce a

[PAD] token and deﬁne a maximum number of elements

in the layout as M∈N. Each layout is ﬁxed-dimensional

data composed of 5Mtokens by appending 5(M−E)

[PAD] tokens. [PAD] is treated similarly to the ordi-

nary token in VQDiffusion and Qt’s dimension becomes

(K+ 2) ×(K+ 2).

Modality-wise diffusion Discrete state-space models as-

sume that all the standard tokens are switchable by corrup-

tion. However, layout tokens comprise a disjoint set of to-

ken groups for each attribute in the element. For example,

applying the transition rule Eq. (5) may change a token rep-

resenting an element’s category to another token represent-

ing the width. To avoid such invalid switching, we propose

to apply disjoint corruption matrices Qc

t,Qx

t,Qy

t,Qw

t,Qh

for tokens representing different attributes c, x, y, w, h, as

we show in Fig. 2. The size of each matrix is (C+ 2) ×

(C+ 2) for Qc

tand otherwise (B+ 2) ×(B+ 2), where

+2 is for [PAD] and [MASK].

Adaptive Quantization The distribution of the position

and size information in layouts is highly imbalanced; e.g.,

elements tend to be aligned to either left, center, or right.

Applying uniform quantization to those quantities as in ex-

isting layout generation models [2,11,20] results in the loss

of information. As a pre-processing, we propose to apply

a classical clustering algorithm, such as KMeans [29] on x,

y,w, and hindependently to obtain balanced position and

size tokens for each dataset. We show in Sec. 4.7 how quan-

tization strategy affects the resulting quality.

Decoupled Positional Encoding Previous works apply

standard positional encoding to a ﬂattened sequence of lay-

out tokens lﬂat [2,11,20]. We argue that this ﬂattening ap-

proach could lose the structure information of the layout and

lead to inferior generation performance. In layout, each to-

ken has two types of indices: i-th element and j-th attribute.

We empirically ﬁnd that independently applying positional

encoding to those indices improves ﬁnal generation perfor-

mance, which we study in Sec. 4.7.

3.3. Conditional Generation

We elaborate on solving various conditional layout gen-

eration tasks using pre-trained frozen LayoutDM. We inject

conditional information in both the initial state zTand sam-

pled states {zt}T−1

t=0 during inference but do not modify the

denoising network pθ. The actual implementation of the in-

jection differs by the type of conditions.

Strong Constraints The most typical condition is par-

tially known layout ﬁelds. Let zknown ∈ZNcontain the

known ﬁelds and m∈ {0,1}Nbe a mask vector denot-

ing the known and unknown ﬁeld as 1and 0, respectively.

In each timestep t, we sample ˆ

zt−1from pθ(zt−1|zt)in

Eq. (3) and then inject the condition by zt−1=m⊙

zknown+(1−m)⊙ˆ

zt−1, where 1denotes a N-dimensional

all-ones vector and ⊙denotes element-wise product.

Weak Constraints We may impose a weaker constraint

during generation, such as an element in the center. We

offer a way to impose such constraints in a uniﬁed frame-

work without additional training or external neural network

models. We propose to adjust the logits to inject weak con-

straints in log probability space by

log ˆpθ(zt−1|zt)∝log pθ(zt−1|zt) + λππ,(6)

where π∈RN×Kis a prior term that weights the de-

sired outputs, and λπ∈Ris a hyper-parameter. The

prior term can be deﬁned either hard-coded (Reﬁnement in

Sec. 4.5) or through differentiable loss functions (Relation-

ship in Sec. 4.5). Let {Li}L

i=1 be a set of differentiable loss

functions given the prediction, the later prior deﬁnition can

be written by:

π=−∇pθ(zt−1|zt)

i=1

Li(pθ(zt−1|zt)) .(7)

Although the formulation of Eq. (7) resembles steering dif-

fusion models by gradients from external models [7,25],

our primal focus is incorporating classical hand-crafted en-

ergies for aesthetics principles of layout [32] that do not de-

pend on an external model. In practice, we tune the hyper-

parameters for imposing weak constraints, such as λπ. Note

that these hyper-parameters are only for inference and are

easier to tune than the other training hyper-parameters.

4. Experiment

4.1. Datasets

We use two large-scale datasets for comparison, Rico [5]

and PubLayNet [45]. As we mention in Sec. 3.2, an ele-

ment in a layout for each dataset is described by the ﬁve

attributes. For preprocessing, we set the maximum number

of elements per layout Mto 25. If a layout contains more

elements, we discard the whole layout.

We provide an overview of each dataset. Rico is a dataset

of user interface designs for mobile applications containing

25 element categories such as text button, toolbar, and icon.

We divide the dataset into 35,851 / 2,109 / 4,218 samples for

train, validation, and test splits. PubLayNet is a dataset of

research papers containing ﬁve element categories, such as

table, image, and text. We divide the dataset into 315,757 /

16,619 / 11,142 samples for train, validation, and test splits.

10170

4.2. Evaluation Metrics

We employ two primary metrics: FID and Maximum

IoU (Max.). These metrics take into account both ﬁdelity

and diversity [12], which are two mutually complemen-

tary properties widely used in evaluating generative mod-

els. FID [12] captures the similarity of generated data to

real ones in feature space. We employ an improved feature

extraction model for layouts [19] instead of a conventional

method [21] to compute FID. Maximum IoU [19] measures

the conditional similarity between generated and real lay-

outs. The similarity is measured by computing optimal

matching that maximizes average IoU between generated

and real layouts that have an identical set of categories. For

reference, we show the FID and Maximum IoU computed

between the validation and test data as Real data.

4.3. Tasks and Baselines

We test LayoutDM on six tasks for evaluation.

Unconditional generates layouts without any conditional

input or constraint.

Category→size+position (C→S+P) is a generation task

conditioned on the category of each element [20].

Category+size→position (C+S→P) is conditioned on the

category and size of each element.

Completion is conditioned on a small number of elements

whose attributes are all known. Given a complete layout,

we randomly sample from 0% to 20% of elements.

Reﬁnement is conditioned on a noisy layout in which

only geometric information is perturbed [36]. Following

RUITE [36], we synthesize the input layout by adding ran-

dom noise to the size and position of each element. We sam-

ple noise from a standard normal distribution with a mean

of 0 and a variance of 0.01.

Relationship is conditioned on the category of each ele-

ment and some relationship constraints between the ele-

ments [21]. Following CLG-LO [19], we employ the size

and location relationships and randomly sample 10% rela-

tionships between elements for the experiment.

The ﬁrst four tasks handle basic layout ﬁelds. We include

a few task-agnostic models for comparison using existing

controllable layout generation methods or simple adaptation

of generative models in the following:

LayoutTrans is a simple autoregressive model [11] trained

on a element-level shufﬂed layout, following [33]. We set a

variable generation order to c→w→h→x→y.

MaskGIT∗is originally a non-autoregressive model for un-

conditional ﬁxed-length data generation [4]. We use [PAD]

to enable variable-length generation.

BLT is a non-autoregressive model with layout-speciﬁc de-

coding strategy [20].

BART is a denoising autoencoder that can solve both

comprehension and generation tasks based on Transformer

encoder-decoder backbone [22]. We randomly generate a

number of [MASK] tokens from a uniform distribution be-

tween one and the sequence length, and perform masking

based on the number.

VQDiffusion∗is a diffusion-based model originally for

text-to-image generation [9]. We adapt the model for layout

using K=C+ 4B+ 2 tokens, including [PAD].

4.4. Implementation Details

We re-implement most of the models since there are few

ofﬁcial implementations publicly available except [11,19,

20]2. We train all the models on the two datasets with three

independent trials and report the average of the results.

LayoutDM follows VQDiffusion for hyper-parameters

unless speciﬁed, such as conﬁgurations for pθand the tran-

sition matrix parameters i.e.αtand γt. We set the loss

weight λ= 0.1(in Eq. (4)) and the diffusion timesteps

T= 100. For optimization, we use AdamW [27] with

learning rate of 5.0×10−4,β1= 0.9, and β2= 0.98.

Many models, including LayoutDM, use Trans-

former [40] encoder backbone. We deﬁne a shared

conﬁguration as follows: 4 layers, 8 attention heads, 512

embedding dimensions, 2048 hidden dimensions, and 0.1

dropout rate. For other models with extra modules, we ad-

just the number of hidden dimensions to roughly match the

number of parameters for a fair comparison. We randomly

shufﬂe elements in the layout to avoid ﬁxed-order gener-

ation during training. We search best hyper-parameters to

obtain the best FID using the validation set.

4.5. Quantitative Evaluation

C→S+P, C+S→P, Completion In these tasks, we in-

ject conditions by masking. We summarize comparisons

in Tab. 1. As task-speciﬁc models, we include Layout-

VAE [16], NDN-none [21], and LayoutGAN++ [19] for

C→S+P. We also adapt these models for C+S→P. Lay-

outDM outperforms other models except LayoutTrans [11]

in completion. The signiﬁcant performance gap between

LayoutDM and VQDiffusion* suggests the contribution of

our proposals to go beyond the simple discrete diffusion

models discussed in Sec. 3.2. Results in the completion

suggest that a combination of padding and diffusion mod-

els is the primal key to the generation quality. We ﬁnd that

FID and Maximum IoU are not highly correlated only in the

completion task. We conjecture that Maximum IoU may

become unstable when categories are also predicted, unlike

the C→S+P and C+S→P tasks where categories are given.

Fig. 3shows the qualitative results of some models, in-

cluding LayoutDM. We can see that LayoutDM generates

2Unfortunately, most datasets have no ofﬁcial train-val-test splits, and

previous approaches work on different splits and pre-processing strategies.

Furthermore, models for FID computation also vary. Thus, we cannot di-

rectly compare our results with the reported ﬁgure in the literature.

10171

Table 1. Quantitative comparison in conditional generation given partially known ﬁelds. Top two results are highlighted in bold and

underline, respectively. †indicates the results of BLT trained with [PAD] as an additional vocabulary since the original model cannot

perform unordered completion in practice.

Task Category →Size+Position Category+Size →Position Completion

Dataset Rico PubLayNet Rico PubLayNet Rico PubLayNet

Model FID ↓Max. ↑FID ↓Max. ↑FID ↓Max. ↑FID ↓Max. ↑FID ↓Max. ↑FID ↓Max. ↑

Task-speciﬁc models

LayoutVAE [16] 33.3 0.249 26.0 0.316 30.6 0.283 27.5 0.315 - - - -

NDN-none [21] 28.4 0.158 61.1 0.162 62.8 0.219 69.4 0.222 - - - -

LayoutGAN++ [19] 6.84 0.267 24.0 0.263 6.22 0.348 9.94 0.342 - - - -

Task-agnostic models

LayoutTrans [11] 5.57 0.223 14.1 0.272 3.73 0.323 16.9 0.320 3.71 0.537 8.36 0.451

MaskGIT∗[4] 26.1 0.262 17.2 0.319 8.05 0.320 5.86 0.380 33.5 0.533 19.7 0.484

BLT [20] 17.4 0.202 72.1 0.215 4.48 0.340 5.10 0.387 117†0.471†131†0.345†

BART [22] 3.97 0.253 9.36 0.320 3.18 0.334 5.88 0.375 8.87 0.527 9.58 0.446

VQDiffusion∗[9] 4.34 0.252 10.3 0.319 3.21 0.331 7.13 0.374 11.0 0.541 11.1 0.373

LayoutDM 3.55 0.277 7.95 0.310 2.22 0.392 4.25 0.381 9.00 0.576 7.65 0.377

Real data 1.85 0.691 6.25 0.438 1.85 0.691 6.25 0.438 1.85 0.691 6.25 0.438

Rico PubLayNet

Condition Layout-

Trans [11]BLT [20] BART [22] LayoutDM Real Condition Layout-

Trans [11]BLT [20] BART [22] LayoutDM Real

C→S+PC+S→P

Completion

Figure 3. Comparison in conditional generation given partially known ﬁelds.

high-quality layouts with fewer layout aesthetics violations,

such as misalignment and overlap, given diverse conditions.

Unconditional Generation Tab. 2summarizes the results

of unconditional generation. Unconditional layout genera-

tion methods often assume ﬁxed order for element genera-

tion e.g. top-to-bottom rather than random order for better

generation quality by constraining the prediction. For refer-

ence, we additionally report the results of LayoutTrans [11]

trained on the ﬁxed element order (LayoutTrans-ﬁxed). Al-

though we design LayoutDM’s primarily for conditional

generation, LayoutDM achieves the best FID under random

element order setting. We conjecture that BLT’s poor per-

formance is due to train-test mask distribution inconsistency

caused by their hierarchical masking strategy for training.

BLT masks a randomly sampled number of ﬁelds from a

single semantic group i.e. category, position, or size. How-

ever, decoding starts with all masked tokens in inference.

The alignment metric of Real data stays at 0.109 in Rico.

Too small alignment values of LayoutTrans and MaskGIT

can be a signal of producing trivial outputs in Rico.

10172

Table 2. Quantitative comparison in unconditional generation. Top

two results are highlighted in bold and underline, respectively.

Dataset Rico PubLayNet

Model FID ↓Align. ↓FID ↓Align. ↓

LayoutTrans-ﬁxed [11]6.47 0.133 17.1 0.084

LayoutTrans [11] 7.63 0.068 13.9 0.127

MaskGIT∗[4] 52.1 0.015 27.1 0.101

BLT [20] 88.2 1.030 116 0.153

BART [22] 11.9 0.090 16.6 0.116

VQDiffusion∗[9] 7.46 0.178 15.4 0.193

LayoutDM 6.65 0.162 13.9 0.195

Real data 1.85 0.109 6.25 0.0214

Table 3. Quantitative comparison in the reﬁnement task. Top two

results are highlighted in bold and underline, respectively.

Dataset Rico PubLayNet

Model FID ↓Max. ↑Sim ↑FID ↓Max. ↑Sim ↑

Task-speciﬁc models

RUITE [36] 3.23 0.421 0.221 6.39 0.415 0.174

Task-agnostic models

Noisy input 134 0.213 0.177 130 0.242 0.147

LayoutDM 2.77 0.370 0.205 6.75 0.352 0.149

w/o logit adj. 3.55 0.277 0.168 7.95 0.310 0.127

Real data 1.85 0.691 0.260 6.25 0.438 0.216

Input RUITE [36] LayoutDM Real

Rico

PubLayNet

Figure 4. Qualitative comparison in the reﬁnement task.

Reﬁnement Our LayoutDM performs this task with a

combination of the strong constraints of element categories,

i.e., setting zknown ={(c1,[MASK],...,[MASK]), . . .},

and the weak constraints that geometric outputs appear near

noisy inputs. As an example of the weak constraint, we

describe a constraint that imposes the x-coordinate estimate

of i-th element close to the noisy continuous observation ˆxi.

We denote a sliced vector of the prior term πin Eq. (6) that

corresponds to the x-coordinate of i-th element as πi

x∈RK

100101102

FID

Violation

Rico

100101102

FID

Violation

PubLayNet

LayoutGAN++

w/ CLG-LO

LayoutDM

w/o logit adjustment

NDN-partial

Figure 5. Quality-violation trade-off in the relationship task.

Lower scores indicate better performance for both metrics.

Relationship w/o rel. w/ rel. Real

Rico

PubLayNet

Figure 6. Qualitative comparison in the relationship task.

and deﬁne by:

πi

xj=(1if |loc(j)−ˆxi|< m and j∈X

0otherwise,(8)

where mis a hyper-parameter indicating a margin, Xis a

set of indices denoting tokens for xin the vocabularies, and

loc(j)is a function that returns the centroid value of j-th

token in the vocabularies. We deﬁne similar constraints for

the other geometric variables and elements.

We summarize the performance in Tab. 3. We addition-

ally report DocSim [34] (Sim) to measure the similarity of a

predicted and its corresponding ground truth layout. Impos-

ing noisy geometric ﬁelds as a weak prior signiﬁcantly im-

proves the masking-only model and makes the performance

much closer to RUITE [36], which is a denoising model not

applicable to other layout tasks. We compare some results

in Fig. 4. Both LayoutDM and RUITE successfully recover

complete layouts from non-trivially noisy layouts.

Relationship We use Eq. (7) to incorporate the relational

constraints during the sampling step of LayoutDM. We fol-

low [19] to employ the loss functions penalizing size and

10173

0 50 100 150

Time per sample [ms]

100

101

102

FID

PubLayNet

LayoutVAE

NDN-none

LayoutGAN++

LayoutTrans.

BART

MaskGIT*

BLT

VQDiffusion*

LayoutDM

Real Data

Figure 7. Speed-quality trade-off of different models for C+S→P.

Table 4. Ablation study results on layout-speciﬁc modiﬁcation in

unconditional generation of Rico [5] dataset.

FID ↓Align. ↓

LayoutDM 6.65 0.162

w/o modality-wise diff. 7.32 0.156

w/o decoupled pos. enc. 6.78 0.227

w/ uniform-quantization 7.58 0.256

w/ percentile-quantization 9.79 0.232

Real Data 1.85 0.109

location relationships between elements that do not match

user speciﬁcations. We deﬁne the loss functions for contin-

uous bounding boxes, and we have to convert the predicted

discrete bounding boxes to continuous ones in a differen-

tiable manner. Given estimated probabilities of discrete x-

coordinates p(x), for example, we compute the continuous

x-coordinate ¯xby ¯x=Pn∈Xp(x=n) loc(n). Similar

conversion applies to the other attributes. Empirically, we

ﬁnd that applying the logit adjustment multiple times (three

times in our experiments) to each diffusion step moderately

improves performance.

We compare LayoutDM with two task-speciﬁc ap-

proaches: NDN-partial [21] and CLG-LO based on Lay-

outGAN++ [19]. We show the results in Fig. 5. We ad-

ditionally report constraint violation error rates [19]. Lay-

outDM can control the strength of the logit adjustment as

in Eq. (6) and produces an FID-violation trade-off curve.

LayoutDM is comparable to NDN-partial in Rico and out-

performs NDN-partial by a large margin in PubLayNet. Al-

though LayoutDM is inferior to CLG-LO in both datasets,

note that the average runtime of CLG-LO is 4.0s, which is

much slower than 0.5s in LayoutDM. We show some results

of LayoutDM in Fig. 6.

4.6. Speed-Quality Trade-off

Runtime is also essential for a controllable generation.

We show a speed-quality trade-off curve for C+S→P as

shown in Fig. 7. The Transformer encoder-only models,

such as LayoutDM and BLT, can achieve fast generation at

the sacriﬁce of quality. We employ fast-sampling strategy

employed in discrete diffusion models [3] for LayoutDM

by pθ(zt−∆|zt)∝P˜

z0q(zt−∆,zt|˜

z0)˜pθ(˜

z0|zt), where

∆∈Nindicates a step size for generation in T

∆steps. De-

spite being a task-agnostic model, LayoutDM achieves the

best quality-speed trade-off except for task-speciﬁc Layout-

GAN++ [19] that runs under 10ms.

4.7. Ablation Study

We investigate whether techniques in Sec. 3.2 improve

the performance. First, we evaluate a choice of quantization

methods for the geometric ﬁelds of elements. Instead of

KMeans, we compute centroids for the quantization by:

• Uniform: This is dataset-agnostic quantization, which is

popular in previous works [2,11,20]. Following [11], we

choose {0.0,1

B,..., B−1

B}and {1

B,..., B−1

B,1.0}for

the position and size, respectively.

• Percentile: we sort the data into equally sized groups and

obtain average values for each group as the centroids.

This is dataset-speciﬁc quantization similar to KMeans.

We show the result at the bottom of Tab. 4. We additionally

report the Alignment metric (Align.) used in [19] since the

choice of the quantization affects the alignment between el-

ements. Compared to Linear and Percentile, KMeans quan-

tization signiﬁcantly improves both FID and Alignment.

We conﬁrm that our modality-wise diffusion and decou-

pled positional encoding both moderately improve the per-

formance, as we show at the top half of Tab. 4.

5. Discussion

LayoutDM is based on diffusion models for discrete

state-space. Using continuous state-space as in latent dif-

fusion models [37] would be interesting. Extension of

LayoutDM to handle various layout properties such as

color [17] and image/text [41] is also appealing.

We believe our proposed logit adjustment can incorpo-

rate more attributes. Attribute-conditional LayoutGAN [24]

considers area, aspect ratio, and reading order of elements

for ﬁne-grained control. Since these attributes can be easily

converted to the size and location relationship constraints,

incorporating them with our LayoutDM is not very difﬁcult.

Potential negative impact Our model might be used to

automatically generate the basic structure of fake websites

or mobile applications, which could lead to scams or the

spreading of misinformation.

10174

References

[1] Maneesh Agrawala, Wilmot Li, and Floraine Berthouzoz.

Design principles for visual communication. Communica-

tions of the ACM, 54(4), 2011. 2

[2] Diego Martin Arroyo, Janis Postels, and Federico Tombari.

Variational transformer networks for layout generation. In

CVPR, 2021. 1,2,3,4,8

[3] Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tar-

low, and Rianne van den Berg. Structured denoising diffu-

sion models in discrete state-spaces. In NeurIPS, 2021. 1,2,

3,8

[4] Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T

Freeman. MaskGIT: Masked generative image transformer.

In CVPR, 2022. 5,6,7

[5] Biplab Deka, Zifeng Huang, Chad Franzen, Joshua Hib-

schman, Daniel Afergan, Yang Li, Jeffrey Nichols, and Ran-

jitha Kumar. Rico: A mobile app dataset for building data-

driven design applications. In UIST, 2017. 1,2,4,8

[6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina

Toutanova. BERT: Pre-training of deep bidirectional trans-

formers for language understanding. In NAACL, 2019. 2

[7] Prafulla Dhariwal and Alexander Nichol. Diffusion models

beat gans on image synthesis. In NeurIPS, 2021. 2,4

[8] Tsu-Jui Fu, William Yang Wang, Daniel McDuff, and Yale

Song. DOC2PPT: Automatic presentation slides generation

from scientiﬁc documents. In AAAI, 2022. 2

[9] Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo

Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vec-

tor quantized diffusion model for text-to-image synthesis. In

CVPR, 2022. 1,2,3,5,6,7

[10] Shunan Guo, Zhuochen Jin, Fuling Sun, Jingwen Li, Zhaorui

Li, Yang Shi, and Nan Cao. Vinci: an intelligent graphic

design system for generating advertising posters. In CHI,

2021. 2

[11] Kamal Gupta, Alessandro Achille, Justin Lazarow, Larry

Davis, Vijay Mahadevan, and Abhinav Shrivastava. Layout-

Transformer: Layout generation and completion with self-

attention. In ICCV, 2021. 1,2,3,4,5,6,7,8

[12] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,

Bernhard Nessler, and Sepp Hochreiter. Gans trained by a

two time-scale update rule converge to a local nash equilib-

rium. In NeurIPS, 2017. 5

[13] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu-

sion probabilistic models. In NeurIPS, 2020. 1,2

[14] Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick

Forr´

e, and Max Welling. Argmax ﬂows and multinomial

diffusion: Towards non-autoregressive language models. In

NeurIPS, 2021. 1,2

[15] Zhaoyun Jiang, Shizhao Sun, Jihua Zhu, Jian-Guang Lou,

and Dongmei Zhang. Coarse-to-ﬁne generative modeling for

graphic layouts. In AAAI, 2022. 2

[16] Akash Abdu Jyothi, Thibaut Durand, Jiawei He, Leonid Si-

gal, and Greg Mori. LayoutVAE: Stochastic scene layout

generation from a label set. In CVPR, 2019. 2,5,6

[17] Kotaro Kikuchi, Naoto Inoue, Mayu Otani, Edgar Simo-

Serra, and Kota Yamaguchi. Generative colorization of struc-

tured mobile web pages. In WACV, 2023. 8

[18] Kotaro Kikuchi, Mayu Otani, Kota Yamaguchi, and Edgar

Simo-Serra. Modeling visual containment for web page lay-

out optimization. Computer Graphics Forum, 40(7), 2021.

[19] Kotaro Kikuchi, Edgar Simo-Serra, Mayu Otani, and Kota

Yamaguchi. Constrained graphic layout generation via latent

optimization. In ACM MM, 2021. 1,2,5,6,7,8

[20] Xiang Kong, Lu Jiang, Huiwen Chang, Han Zhang, Yuan

Hao, Haifeng Gong, and Irfan Essa. BLT: Bidirectional lay-

out transformer for controllable layout generation. In ECCV,

2022. 1,2,3,4,5,6,7,8

[21] Hsin-Ying Lee, Weilong Yang, Lu Jiang, Madison Le, Ir-

fan Essa, Haifeng Gong, and Ming-Hsuan Yang. Neural de-

sign network: Graphic layout generation with constraints. In

ECCV, 2020. 1,2,5,6,8

[22] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvinine-

jad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and

Luke Zettlemoyer. BART: Denoising sequence-to-sequence

pre-training for natural language generation, translation, and

comprehension. In ACL, 2020. 5,6,7

[23] Jianan Li, Jimei Yang, Aaron Hertzmann, Jianming Zhang,

and Tingfa Xu. LayoutGAN: Generating graphic layouts

with wireframe discriminators. In ICLR, 2019. 2

[24] Jianan Li, Jimei Yang, Jianming Zhang, Chang Liu,

Christina Wang, and Tingfa Xu. Attribute-conditioned lay-

out gan for automatic graphic design. IEEE TVCG, 27(10),

2020. 8

[25] Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and

Joshua B Tenenbaum. Compositional visual generation with

composable diffusion models. In ECCV, 2022. 4

[26] Simon Lok and Steven Feiner. A survey of automated layout

techniques for information presentations. In SmartGraphics,

2001. 2

[27] Ilya Loshchilov and Frank Hutter. Decoupled weight decay

regularization. In ICLR, 2019. 5

[28] Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher

Yu, Radu Timofte, and Luc Van Gool. RePaint: Inpainting

using denoising diffusion probabilistic models. In CVPR,

2022. 2

[29] J MacQueen. Classiﬁcation and analysis of multivariate ob-

servations. In 5th Berkeley Symp. Math. Statist. Probability,

1967. 4

[30] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jia-

jun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided

image synthesis and editing with stochastic differential equa-

tions. In ICLR, 2022. 2

[31] Paul Merrell, Eric Schkufza, Zeyang Li, Maneesh Agrawala,

and Vladlen Koltun. Interactive furniture layout using inte-

rior design guidelines. ACM TOG, 30(4), 2011. 2

[32] Peter O’Donovan, Aseem Agarwala, and Aaron Hertzmann.

Learning layouts for single-pagegraphic designs. IEEE

TVCG, 20(8), 2014. 2,4

[33] Despoina Paschalidou, Amlan Kar, Maria Shugrina, Karsten

Kreis, Andreas Geiger, and Sanja Fidler. ATISS: Autoregres-

sive transformers for indoor scene synthesis. In NeurIPS,

2021. 2,5

10175

[34] Akshay Gadi Patil, Omri Ben-Eliezer, Or Perel, and Hadar

Averbuch-Elor. READ: Recursive autoencoders for docu-

ment layout generation. In CVPRW, 2020. 7

[35] Chunyao Qian, Shizhao Sun, Weiwei Cui, Jian-Guang Lou,

Haidong Zhang, and Dongmei Zhang. Retrieve-then-adapt:

Example-based automatic generation for proportion-related

infographics. IEEE TVCG, 27(2), 2020. 2

[36] Soliha Rahman, Vinoth Pandian Sermuga Pandian, and

Matthias Jarke. RUITE: Reﬁning ui layout aesthetics using

transformer encoder. In 26th International Conference on

Intelligent User Interfaces-Companion, 2021. 2,5,7

[37] Robin Rombach, Andreas Blattmann, Dominik Lorenz,

Patrick Esser, and Bj¨

orn Ommer. High-resolution image syn-

thesis with latent diffusion models. In CVPR, 2022. 8

[38] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan,

and Surya Ganguli. Deep unsupervised learning using

nonequilibrium thermodynamics. In ICML, 2015. 2,3

[39] Yang Song and Stefano Ermon. Improved techniques for

training score-based generative models. In NeurIPS, 2020.

[40] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-

reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia

Polosukhin. Attention is all you need. In NeurIPS, 2017. 3,

[41] Kota Yamaguchi. CanvasVAE: Learning to generate vector

graphics documents. In ICCV, 2021. 2,8

[42] Xuyong Yang, Tao Mei, Ying-Qing Xu, Yong Rui, and

Shipeng Li. Automatic generation of visual-textual presen-

tation layout. TOMM, 12(2), 2016. 2

[43] Lap Fai Yu, Sai Kit Yeung, Chi Keung Tang, Demetri Ter-

zopoulos, Tony F Chan, and Stanley J Osher. Make it home:

automatic optimization of furniture arrangement. ACM TOG,

30(4), 2011. 2

[44] Xinru Zheng, Xiaotian Qiao, Ying Cao, and Rynson WH

Lau. Content-aware generative modeling of graphic design

layouts. ACM TOG, 38(4), 2019. 2

[45] Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes. Pub-

LayNet: largest dataset ever for document layout analysis. In

ICDAR, 2019. 1,2,4

10176

Inoue_LayoutDM_Discrete_Diffusion_CVPR_2023_supplemental.pdf

Data

September 2023

Naoto Inoue · Kotaro Kikuchi · Edgar Simo-Serra · Mayu Otani · Kota Yamaguchi

Download

Mixed Diffusion for 3D Indoor Scene Synthesis

Preprint

Full-text available

May 2024

Realistic conditional 3D scene synthesis significantly enhances and accelerates the creation of virtual environments, which can also provide extensive training data for computer vision and robotics research among other applications. Diffusion models have shown great performance in related applications, e.g., making precise arrangements of unordered sets. However, these models have not been fully explored in floor-conditioned scene synthesis problems. We present MiDiffusion, a novel mixed discrete-continuous diffusion model architecture, designed to synthesize plausible 3D indoor scenes from given room types, floor plans, and potentially pre-existing objects. We represent a scene layout by a 2D floor plan and a set of objects, each defined by its category, location, size, and orientation. Our approach uniquely implements structured corruption across the mixed discrete semantic and continuous geometric domains, resulting in a better conditioned problem for the reverse denoising step. We evaluate our approach on the 3D-FRONT dataset. Our experimental results demonstrate that MiDiffusion substantially outperforms state-of-the-art autoregressive and diffusion models in floor-conditioned 3D scene synthesis. In addition, our models can handle partial object constraints via a corruption-and-masking strategy without task specific training. We show MiDiffusion maintains clear advantages over existing approaches in scene completion and furniture arrangement experiments.

CityCraft: A Real Crafter for 3D City Generation

Preprint

Full-text available

Jun 2024

City scene generation has gained significant attention in autonomous driving, smart city development, and traffic simulation. It helps enhance infrastructure planning and monitoring solutions. Existing methods have employed a two-stage process involving city layout generation, typically using Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), or Transformers, followed by neural rendering. These techniques often exhibit limited diversity and noticeable artifacts in the rendered city scenes. The rendered scenes lack variety, resembling the training images, resulting in monotonous styles. Additionally, these methods lack planning capabilities, leading to less realistic generated scenes. In this paper, we introduce CityCraft, an innovative framework designed to enhance both the diversity and quality of urban scene generation. Our approach integrates three key stages: initially, a diffusion transformer (DiT) model is deployed to generate diverse and controllable 2D city layouts. Subsequently, a Large Language Model(LLM) is utilized to strategically make land-use plans within these layouts based on user prompts and language guidelines. Based on the generated layout and city plan, we utilize the asset retrieval module and Blender for precise asset placement and scene construction. Furthermore, we contribute two new datasets to the field: 1)CityCraft-OSM dataset including 2D semantic layouts of urban areas, corresponding satellite images, and detailed annotations. 2) CityCraft-Buildings dataset, featuring thousands of diverse, high-quality 3D building assets. CityCraft achieves state-of-the-art performance in generating realistic 3D cities.

PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM

Preprint

Full-text available

Jun 2024

Layout generation is the keystone in achieving automated graphic design, requiring arranging the position and size of various multi-modal design elements in a visually pleasing and constraint-following manner. Previous approaches are either inefficient for large-scale applications or lack flexibility for varying design requirements. Our research introduces a unified framework for automated graphic layout generation, leveraging the multi-modal large language model (MLLM) to accommodate diverse design tasks. In contrast, our data-driven method employs structured text (JSON format) and visual instruction tuning to generate layouts under specific visual and textual constraints, including user-defined natural language specifications. We conducted extensive experiments and achieved state-of-the-art (SOTA) performance on public multi-modal layout generation benchmarks, demonstrating the effectiveness of our method. Moreover, recognizing existing datasets' limitations in capturing the complexity of real-world graphic designs, we propose two new datasets for much more challenging tasks (user-constrained generation and complicated poster), further validating our model's utility in real-life settings. Marking by its superior accessibility and adaptability, this approach further automates large-scale graphic design tasks. The code and datasets will be publicly available on https://github.com/posterllava/PosterLLaVA.

Revision Matters: Generative Design Guided by Revision Edits

Preprint

May 2024

Layout design, such as user interface or graphical layout in general, is fundamentally an iterative revision process. Through revising a design repeatedly, the designer converges on an ideal layout. In this paper, we investigate how revision edits from human designer can benefit a multimodal generative model. To do so, we curate an expert dataset that traces how human designers iteratively edit and improve a layout generation with a prompted language goal. Based on such data, we explore various supervised fine-tuning task setups on top of a Gemini multimodal backbone, a large multimodal model. Our results show that human revision plays a critical role in iterative layout refinement. While being noisy, expert revision edits lead our model to a surprisingly strong design FID score ~10 which is close to human performance (~6). In contrast, self-revisions that fully rely on model's own judgement, lead to an echo chamber that prevents iterative improvement, and sometimes leads to generative degradation. Fortunately, we found that providing human guidance plays at early stage plays a critical role in final generation. In such human-in-the-loop scenario, our work paves the way for iterative design revision based on pre-trained large multimodal models.

DocSynthv2: A Practical Autoregressive Modeling for Document Generation

Preprint

Full-text available

Jun 2024

While the generation of document layouts has been extensively explored, comprehensive document generation encompassing both layout and content presents a more complex challenge. This paper delves into this advanced domain, proposing a novel approach called DocSynthv2 through the development of a simple yet effective autoregressive structured model. Our model, distinct in its integration of both layout and textual cues, marks a step beyond existing layout-generation approaches. By focusing on the relationship between the structural elements and the textual content within documents, we aim to generate cohesive and contextually relevant documents without any reliance on visual components. Through experimental studies on our curated benchmark for the new task, we demonstrate the ability of our model combining layout and textual information in enhancing the generation quality and relevance of documents, opening new pathways for research in document creation and automated design. Our findings emphasize the effectiveness of autoregressive models in handling complex document generation tasks.

Variational Flow Matching for Graph Generation

Preprint

Jun 2024

We present a formulation of flow matching as variational inference, which we refer to as variational flow matching (VFM). Based on this formulation we develop CatFlow, a flow matching method for categorical data. CatFlow is easy to implement, computationally efficient, and achieves strong results on graph generation tasks. In VFM, the objective is to approximate the posterior probability path, which is a distribution over possible end points of a trajectory. We show that VFM admits both the CatFlow objective and the original flow matching objective as special cases. We also relate VFM to score-based models, in which the dynamics are stochastic rather than deterministic, and derive a bound on the model likelihood based on a reweighted VFM objective. We evaluate CatFlow on one abstract graph generation task and two molecular generation tasks. In all cases, CatFlow exceeds or matches performance of the current state-of-the-art models.

Patch-enhanced Mask Encoder Prompt Image Generation

Preprint

May 2024

Artificial Intelligence Generated Content(AIGC), known for its superior visual results, represents a promising mitigation method for high-cost advertising applications. Numerous approaches have been developed to manipulate generated content under different conditions. However, a crucial limitation lies in the accurate description of products in advertising applications. Applying previous methods directly may lead to considerable distortion and deformation of advertised products, primarily due to oversimplified content control conditions. Hence, in this work, we propose a patch-enhanced mask encoder approach to ensure accurate product descriptions while preserving diverse backgrounds. Our approach consists of three components Patch Flexible Visibility, Mask Encoder Prompt Adapter and an image Foundation Model. Patch Flexible Visibility is used for generating a more reasonable background image. Mask Encoder Prompt Adapter enables region-controlled fusion. We also conduct an analysis of the structure and operational mechanisms of the Generation Module. Experimental results show our method can achieve the highest visual results and FID scores compared with other methods.

Leveraging Human Revisions for Improving Text-to-Layout Models

Preprint

May 2024

Learning from human feedback has shown success in aligning large, pretrained models with human values. Prior works have mostly focused on learning from high-level labels, such as preferences between pairs of model outputs. On the other hand, many domains could benefit from more involved, detailed feedback, such as revisions, explanations, and reasoning of human users. Our work proposes using nuanced feedback through the form of human revisions for stronger alignment. In this paper, we ask expert designers to fix layouts generated from a generative layout model that is pretrained on a large-scale dataset of mobile screens. Then, we train a reward model based on how human designers revise these generated layouts. With the learned reward model, we optimize our model with reinforcement learning from human feedback (RLHF). Our method, Revision-Aware Reward Models ($\method$), allows a generative text-to-layout model to produce more modern, designer-aligned layouts, showing the potential for utilizing human revisions and stronger forms of feedback in improving generative models.

OpenCOLE: Towards Reproducible Automatic Graphic Design Generation

Preprint

Full-text available

Jun 2024

Automatic generation of graphic designs has recently received considerable attention. However, the state-of-the-art approaches are complex and rely on proprietary datasets, which creates reproducibility barriers. In this paper, we propose an open framework for automatic graphic design called OpenCOLE, where we build a modified version of the pioneering COLE and train our model exclusively on publicly available datasets. Based on GPT4V evaluations, our model shows promising performance comparable to the original COLE. We release the pipeline and training results to encourage open development.

PostDoc: Generating Poster from a Long Multimodal Document Using Deep Submodular Optimization

Preprint

May 2024

A poster from a long input document can be considered as a one-page easy-to-read multimodal (text and images) summary presented on a nice template with good design elements. Automatic transformation of a long document into a poster is a very less studied but challenging task. It involves content summarization of the input document followed by template generation and harmonization. In this work, we propose a novel deep submodular function which can be trained on ground truth summaries to extract multimodal content from the document and explicitly ensures good coverage, diversity and alignment of text and images. Then, we use an LLM based paraphraser and propose to generate a template with various design aspects conditioned on the input content. We show the merits of our approach through extensive automated and human evaluations.

Generative Colorization of Structured Mobile Web Pages

Conference Paper

Full-text available

Jan 2023

Vector Quantized Diffusion Model for Text-to-Image Synthesis

Conference Paper

Full-text available

Jun 2022

BLT: Bidirectional Layout Transformer for Controllable Layout Generation

Chapter

Oct 2022

Creating visual layouts is a critical step in graphic design. Automatic generation of such layouts is essential for scalable and diverse visual designs. To advance conditional layout generation, we introduce BLT, a bidirectional layout transformer. BLT differs from previous work on transformers in adopting non-autoregressive transformers. In training, BLT learns to predict the masked attributes by attending to surrounding attributes in two directions. During inference, BLT first generates a draft layout from the input and then iteratively refines it into a high-quality layout by masking out low-confident attributes. The masks generated in both training and inference are controlled by a new hierarchical sampling policy. We verify the proposed model on six benchmarks of diverse design tasks. Experimental results demonstrate two benefits compared to the state-of-the-art layout transformer models. First, our model empowers layout transformers to fulfill controllable layout generation. Second, it achieves up to 10x speedup in generating a layout at inference time than the layout transformer baseline. Code is released at https://shawnkx.github.io/blt.KeywordsDesignLayout creationTransformerNon-autoregressive

Compositional Visual Generation with Composable Diffusion Models

Chapter

Oct 2022

Large text-guided diffusion models, such as DALLE-2, are able to generate stunning photorealistic images given natural language descriptions. While such models are highly flexible, they struggle to understand the composition of certain concepts, such as confusing the attributes of different objects or relations between objects. In this paper, we propose an alternative structured approach for compositional generation using diffusion models. An image is generated by composing a set of diffusion models, with each of them modeling a certain component of the image. To do this, we interpret diffusion models as energy-based models in which the data distributions defined by the energy functions may be explicitly combined. The proposed method can generate scenes at test time that are substantially more complex than those seen in training, composing sentence descriptions, object relations, human facial attributes, and even generalizing to new combinations that are rarely seen in the real world. We further illustrate how our approach may be used to compose pre-trained text-guided diffusion models and generate photorealistic images containing all the details described in the input descriptions, including the binding of certain object attributes that have been shown difficult for DALLE-2. These results point to the effectiveness of the proposed method in promoting structured generalization for visual generation.KeywordsCompositionalityDiffusion modelsEnergy-based modelsVisual generation

MaskGIT: Masked Generative Image Transformer

Conference Paper

Jun 2022

RePaint: Inpainting using Denoising Diffusion Probabilistic Models

Conference Paper

Jun 2022

High-Resolution Image Synthesis with Latent Diffusion Models

Conference Paper

Jun 2022

Coarse-to-Fine Generative Modeling for Graphic Layouts

Article

Jun 2022

Even though graphic layout generation has attracted growing attention recently, it is still challenging to synthesis realistic and diverse layouts, due to the complicated element relationships and varied element arrangements. In this work, we seek to improve the performance of layout generation by incorporating the concept of regions, which consist of a smaller number of elements and appears like a simple layout, into the generation process. Specifically, we leverage Variational Autoencoder (VAE) as the overall architecture and decompose the decoding process into two stages. The first stage predicts representations for regions, and the second stage fills in the detailed position for each element within the region based on the predicted region representation. Compared to prior studies that merely abstract the layout into a list of elements and generate all the element positions in one go, our approach has at least two advantages. First, by the two-stage decoding, our approach decouples the complex layout generation task into several simple layout generation tasks, which reduces the problem difficulty. Second, the predicted regions can help the model roughly know what the graphic layout looks like and serve as global context to improve the generation of detailed element positions. Qualitative and quantitative experiments demonstrate that our approach significantly outperforms the existing methods, especially on the complex graphic layouts.

DOC2PPT: Automatic Presentation Slides Generation from Scientific Documents

Article

Jun 2022

Creating presentation materials requires complex multimodal reasoning skills to summarize key concepts and arrange them in a logical and visually pleasing manner. Can machines learn to emulate this laborious process? We present a novel task and approach for document-to-slide generation. Solving this involves document summarization, image and text retrieval, slide structure and layout prediction to arrange key elements in a form suitable for presentation. We propose a hierarchical sequence-to-sequence approach to tackle our task in an end-to-end manner. Our approach exploits the inherent structures within documents and slides and incorporates paraphrasing and layout prediction modules to generate slides. To help accelerate research in this domain, we release a dataset about 6K paired documents and slide decks used in our experiments. We show that our approach outperforms strong baselines and produces slides with rich content and aligned imagery.

LayoutTransformer: Layout Generation and Completion with Self-attention

Conference Paper

Oct 2021

LayoutDM: Discrete Diffusion Model for Controllable Layout Generation

Figures

Supplementary resource (1)

Recommended publications

Impulsive semilinear heat equation with delay in control and in state

Growth Kinetics of .BETA.Ti Solid Solution in Reaction Diffusion

Control of gas carburizing by the diagram method

A New Field Effect Attenuator with Distributed Structure for Differential Signals