Variations of an input image by encoding with CLIP and then decoding with a diffusion model. The variations preserve both semantic information like presence of a clock in the painting and the overlapping strokes in the logo, as well as stylistic elements like the surrealism in the painting and the color gradients in the logo, while varying the non-essential details.

Source publication

Figure 1: Selected 1024 × 1024 samples from a production version of our...

Figure 2: A high-level overview of unCLIP. Above the dotted line, we...

Figure 3: Variations of an input image by encoding with CLIP and then...

Figure 4: Variations between two images by interpolating their CLIP...

Figure 8: Samples using different conditioning signals for the same...

Hierarchical Text-Conditional Image Generation with CLIP Latents

Preprint

Full-text available

Apr 2022

Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embeddi...

Context 1

... presence of an encoder and its approximate inverse (the decoder) allows for capabilities beyond text-to-image translation. As in GAN inversion [62,55], encoding and decoding an input image produces semantically similar output images ( Figure 3). We can also interpolate between input images by inverting interpolations of their image embeddings (Figure 4). ...

View in full-text

Context 2

... an image x, we can produce related images that share the same essential content but vary in other apects, such as shape and orientation ( Figure 3). To do this, we apply the decoder to the bipartite representation (z i , x T ) using DDIM with η > 0 for sampling. ...

View in full-text

Figure 3: Effect of different variance σ 2 of the noises in Equation 3...

Zero-shot audio-image retrieval results on AVE and Flickr-SoundNet. A-V...

Zero-shot audio-visual source localization on MUSIC-Solo and VGGSS.

Connecting Multi-modal Contrastive Representations

Preprint

Full-text available

May 2023

Multi-modal Contrastive Representation (MCR) learning aims to encode different modalities into a semantically aligned shared space. This paradigm shows remarkable generalization ability on numerous downstream tasks across various modalities. However, the reliance on massive high-quality data pairs limits its further development on more modalities....

Contexts in source publication

Similar publications