Conference Paper

A New Creative Generation Pipeline for Click-Through Rate with Stable Diffusion Model

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Towards that, Some works [40] used T2I model to generate background images while keeping the main product information unchanged for the advertising scene. Indirectly, some works can be used for advertising image generation. ...
Preprint
Artificial Intelligence Generated Content(AIGC), known for its superior visual results, represents a promising mitigation method for high-cost advertising applications. Numerous approaches have been developed to manipulate generated content under different conditions. However, a crucial limitation lies in the accurate description of products in advertising applications. Applying previous methods directly may lead to considerable distortion and deformation of advertised products, primarily due to oversimplified content control conditions. Hence, in this work, we propose a patch-enhanced mask encoder approach to ensure accurate product descriptions while preserving diverse backgrounds. Our approach consists of three components Patch Flexible Visibility, Mask Encoder Prompt Adapter and an image Foundation Model. Patch Flexible Visibility is used for generating a more reasonable background image. Mask Encoder Prompt Adapter enables region-controlled fusion. We also conduct an analysis of the structure and operational mechanisms of the Generation Module. Experimental results show our method can achieve the highest visual results and FID scores compared with other methods.
Conference Paper
Full-text available
In Taobao, the largest e-commerce platform in China, billions of items are provided and typically displayed with their images.For better user experience and business effectiveness, Click Through Rate (CTR) prediction in online advertising system exploits abundant user historical behaviors to identify whether a user is interested in a candidate ad. Enhancing behavior representations with user behavior images will help understand user's visual preference and improve the accuracy of CTR prediction greatly. So we propose to model user preference jointly with user behavior ID features and behavior images. However, training with user behavior images brings tens to hundreds of images in one sample, giving rise to a great challenge in both communication and computation. To handle these challenges, we propose a novel and efficient distributed machine learning paradigm called Advanced Model Server (AMS). With the well-known Parameter Server (PS) framework, each server node handles a separate part of parameters and updates them independently. AMS goes beyond this and is designed to be capable of learning a unified image descriptor model shared by all server nodes which embeds large images into low dimensional high level features before transmitting images to worker nodes. AMS thus dramatically reduces the communication load and enables the arduous joint training process. Based on AMS, the methods of effectively combining the images and ID features are carefully studied, and then we propose a Deep Image CTR Model. Our approach is shown to achieve significant improvements in both online and offline evaluations, and has been deployed in Taobao display advertising system serving the main traffic.
Conference Paper
Full-text available
Automatic synthesis of realistic images from text would be interesting and useful, but current AI systems are still far from this goal. However, in recent years generic and powerful recurrent neural network architectures have been developed to learn discriminative text feature representations. Meanwhile, deep convolutional generative adversarial networks (GANs) have begun to generate highly compelling images of specific categories such as faces, album covers, room interiors etc. In this work, we develop a novel deep architecture and GAN formulation to effectively bridge these advances in text and image modeling, translating visual concepts from characters to pixels. We demonstrate the capability of our model to generate plausible images of birds and flowers from detailed text descriptions.
Article
Full-text available
Photocopy. Supplied by British Library. Thesis (Ph. D.)--King's College, Cambridge, 1989.
Chapter
In e-commerce, users’ feedback may vary depending on how the information they encounter is structured. Recently, ranking approaches based on deep learning successfully provided good content to users. In this line of work, we propose a novel method for selecting the best from multiple images considering a topic. For a given product, we can commonly imagine selecting the representative from several images describing the product to sell it with intuitive visual information. In this case, we should consider two factors: (1) how attractive each image is to users and (2) how well each image fits the given product concept (i.e., topic). Even though it seems that existing ranking approaches can solve the problem, we experimentally observed that they do not consider the factor (2) correctly. In this paper, we propose CLIK (Contrastive Learning for topic-dependent Image ranKing) that effectively solves the problem by considering both factors simultaneously. Our model performs two novel training tasks. At first, in topic matching, our model learns the semantic relationship between various images and topics based on contrastive learning. Secondly, in image ranking, our model ranks given images considering a given topic leveraging knowledge learned from topic matching using contrastive loss. Both training tasks are done simultaneously by integrated modules with shared weights. Our method showed significant offline evaluation results and had more positive feedback from users in online A/B testing compared to existing methods.
Chapter
We present a systematic study on a new task called dichotomous image segmentation (DIS), which aims to segment highly accurate objects from natural images. To this end, we collected the first large-scale DIS dataset, called DIS5K, which contains 5,470 high-resolution (e.g., 2K, 4K or larger) images covering camouflaged, salient, or meticulous objects in various backgrounds. DIS is annotated with extremely fine-grained labels. Besides, we introduce a simple intermediate supervision baseline (IS-Net) using both feature-level and mask-level guidance for DIS model training. IS-Net outperforms various cutting-edge baselines on the proposed DIS5K, making it a general self-learned supervision network that can facilitate future research in DIS. Further, we design a new metric called human correction efforts (HCE) which approximates the number of mouse clicking operations required to correct the false positives and false negatives. HCE is utilized to measure the gap between models and real-world applications and thus can complement existing metrics. Finally, we conduct the largest-scale benchmark, evaluating 16 representative segmentation models, providing a more insightful discussion regarding object complexities, and showing several potential applications (e.g., background removal, art design, 3D reconstruction). Hoping these efforts can open up promising directions for both academic and industries. Project page: https://xuebinqin.github.io/dis/index.html.
Conference Paper
In this paper, we study the graphic layout generation problem of producing high-quality visual-textual presentation designs for given images. We note that image compositions, which contain not only global semantics but also spatial information, would largely affect layout results. Hence, we propose a deep generative model, dubbed as composition-aware graphic layout GAN (CGL-GAN), to synthesize layouts based on the global and spatial visual contents of input images. To obtain training images from images that already contain manually designed graphic layout data, previous work suggests masking design elements (e.g., texts and embellishments) as model inputs, which inevitably leaves hint of the ground truth. We study the misalignment between the training inputs (with hint masks) and test inputs (without masks), and design a novel domain alignment module (DAM) to narrow this gap. For training, we built a large-scale layout dataset which consists of 60,548 advertising posters with annotated layout information. To evaluate the generated layouts, we propose three novel metrics according to aesthetic intuitions. Through both quantitative and qualitative evaluations, we demonstrate that the proposed model can synthesize high-quality graphic layouts according to image compositions. The data and code will be available at https://github.com/minzhouGithub/CGL-GAN.
Chapter
Homepage is the first touch point in the customer’s journey and is one of the prominent channels of revenue for many e-commerce companies. A user’s attention is mostly captured by homepage banner images (also called Ads/Creatives). The set of banners shown and their design, influence the customer’s interest and plays a key role in optimizing the click through rates of the banners. Presently, massive and repetitive effort is put in, to manually create aesthetically pleasing banner images. Due to the large amount of time and effort involved in this process, only a small set of banners are made live at any point. This reduces the number of banners created as well as the degree of personalization that can be achieved. This paper thus presents a method to generate creatives automatically on a large scale in a short duration. The availability of diverse banners generated helps in improving personalization as they can cater to the taste of larger audience. The focus of our paper is on generating a wide variety of homepage banners that can be made as an input for a user-level personalization engine. Following are the main contributions of this paper: (1) We introduce and explain the need for large scale banner generation for e-commerce companies (2) We present on how we utilize existing deep learning based detectors which can automatically annotate the required objects/tags from the image. (3) We also propose a Genetic Algorithm based method to generate an optimal banner layout for the given image content, input components and other design constraints. (4) Further, to aid the process of picking the right set of banners, we designed a ranking method and evaluated multiple models. All our experiments have been performed on data from Myntra (http://www.myntra.com), one of the top fashion e-commerce players in India.
Conference Paper
Modern online auction-based advertising systems combine item and user features to promote ad creatives with the most revenue.However, new ad creatives have to display for certain initial users before enough click statistics could collected and utilized in later ads ranking and bidding processes. This leads to a well-known challenging cold start problem.In this paper, we argue that the content of the creatives intrinsically determines their performance (e.g. ctr, cvr), and we add a pre-ranking stage based on the content. The stage prunes inferior creatives and thus makes online impressions more effective. Since the pre-ranking stage can be executed offline, we can use deep features and take their well generalization to navigate the cold start problem.Specifically, we propose Pre Evaluation Ad Creation Model (PEAC), a novel method to evaluate creatives even before they were shown in the online ads system. Our proposed PEAC only utilizes ads information such as verbal and visual content, but requires no user data as features. During the online A/B testing, PEAC shows significant improvement in revenue. The method has been implemented and deployed in the large scale online advertising system at ByteDance. Furthermore, we provide detailed analysis on what the model learns, which also gives suggestions for ad creative design.
Article
Deep reinforcement learning is the combination of reinforcement learning (RL) and deep learning. This field of research has been able to solve a wide range of complex decisionmaking tasks that were previously out of reach for a machine. Thus, deep RL opens up many new applications in domains such as healthcare, robotics, smart grids, finance, and many more. This manuscript provides an introduction to deep reinforcement learning models, algorithms and techniques. Particular focus is on the aspects related to generalization and how deep RL can be used for practical applications. We assume the reader is familiar with basic machine learning concepts.
Article
We describe a new training methodology for generative adversarial networks. The key idea is to grow both the generator and discriminator progressively: starting from a low resolution, we add new layers that model increasingly fine details as training progresses. This both speeds the training up and greatly stabilizes it, allowing us to produce images of unprecedented quality, e.g., CelebA images at 1024^2. We also propose a simple way to increase the variation in generated images, and achieve a record inception score of 8.80 in unsupervised CIFAR10. Additionally, we describe several implementation details that are important for discouraging unhealthy competition between the generator and discriminator. Finally, we suggest a new metric for evaluating GAN results, both in terms of image quality and variation. As an additional contribution, we construct a higher-quality version of the CelebA dataset.
Article
We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1/2 everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.
Conference Paper
There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net .
Attention is all you need. Advances in neural information processing systems
  • Ashish Vaswani
  • Noam Shazeer
  • Niki Parmar
  • Jakob Uszkoreit
  • Llion Jones
  • Aidan N Gomez
  • Łukasz Kaiser
  • Illia Polosukhin
  • Vaswani Ashish
Neural discrete representation learning. Advances in neural information processing systems
  • Aaron Van Den Oord Oriol
  • Vinyals
Alias-free generative adversarial networks
  • Tero Karras
  • Miika Aittala
  • Samuli Laine
  • Erik Härkönen
  • Janne Hellsten
  • Jaakko Lehtinen
  • Timo Aila
  • Karras Tero
Diffusion models beat gans on image synthesis
  • Prafulla Dhariwal
  • Alexander Nichol
  • Dhariwal Prafulla
Glide: Towards photorealistic image generation and editing with text-guided diffusion models
  • Alex Nichol
  • Prafulla Dhariwal
  • Aditya Ramesh
  • Pranav Shyam
  • Pamela Mishkin
  • Bob Mcgrew
  • Ilya Sutskever
  • Mark Chen
  • Nichol Alex
Projected gans converge faster
  • Axel Sauer
  • Kashyap Chitta
  • Jens Müller
  • Andreas Geiger
  • Sauer Axel
Generating diverse high-fidelity images with vq-vae-2
  • Ali Razavi
  • Razavi Ali
Denoising diffusion probabilistic models
  • Jonathan Ho
  • Ajay Jain
  • Pieter Abbeel
  • Ho Jonathan
Cogview: Mastering text-to-image generation via transformers
  • Ming Ding
  • Zhuoyi Yang
  • Wenyi Hong
  • Wendi Zheng
  • Chang Zhou
  • Da Yin
  • Junyang Lin
  • Xu Zou
  • Zhou Shao
  • Hongxia Yang
  • Ding Ming
Learning transferable visual models from natural language supervision
  • Alec Radford
  • Jong Wook Kim
  • Chris Hallacy
  • Aditya Ramesh
  • Gabriel Goh
  • Sandhini Agarwal
  • Girish Sastry
  • Amanda Askell
  • Pamela Mishkin
  • Jack Clark
  • Radford Alec
Scaling autoregressive models for content-rich text-to-image generation
  • Jiahui Yu
  • Yuanzhong Xu
  • Jing Yu Koh
  • Thang Luong
  • Gunjan Baid
  • Zirui Wang
  • Vijay Vasudevan
  • Alexander Ku
  • Yinfei Yang
  • Burcu Karagol Ayan
  • Yu Jiahui
Zero-shot text-to-image generation
  • Aditya Ramesh
  • Mikhail Pavlov
  • Gabriel Goh
  • Scott Gray
  • Chelsea Voss
  • Alec Radford
  • Mark Chen
  • Ilya Sutskever
  • Ramesh Aditya
Training language models to follow instructions with human feedback
  • Long Ouyang
  • Jeffrey Wu
  • Xu Jiang
  • Diogo Almeida
  • Carroll Wainwright
  • Pamela Mishkin
  • Chong Zhang
  • Sandhini Agarwal
  • Katarina Slama
  • Alex Ray
  • Ouyang Long
Improving image generation with better captions
  • James Betker
  • Gabriel Goh
  • Betker James
Training diffusion models with reinforcement learning
  • Kevin Black
  • Michael Janner
  • Yilun Du
  • Ilya Kostrikov
  • Sergey Levine
  • Black Kevin
Hierarchical text-conditional image generation with clip latents
  • Aditya Ramesh
  • Prafulla Dhariwal
  • Alex Nichol
  • Casey Chu
  • Mark Chen
  • Ramesh Aditya
Photorealistic text-to-image diffusion models with deep language understanding
  • Chitwan Saharia
  • William Chan
  • Saurabh Saxena
  • Lala Li
  • Jay Whang
  • Emily L Denton
  • Kamyar Ghasemipour
  • Raphael Gontijo Lopes
  • Burcu Karagol Ayan
  • Tim Salimans
  • Saharia Chitwan
Low-rank adaptation for fast text-to-image diffusion fine-tuning
  • Simo Ryu
Staging e-commerce products for online advertising using retrieval assisted image generation
  • Yueh-Ning Ku
  • Mikhail Kuznetsov
  • Shaunak Mishra
  • Paloma De
  • Ku Yueh-Ning