Variations of an input image by encoding with CLIP and then decoding with a diffusion model. The variations preserve both semantic information like presence of a clock in the painting and the overlapping strokes in the logo, as well as stylistic elements like the surrealism in the painting and the color gradients in the logo, while varying the non-essential details.

Variations of an input image by encoding with CLIP and then decoding with a diffusion model. The variations preserve both semantic information like presence of a clock in the painting and the overlapping strokes in the logo, as well as stylistic elements like the surrealism in the painting and the color gradients in the logo, while varying the non-essential details.

Source publication
Preprint
Full-text available
Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embeddi...

Contexts in source publication

Context 1
... presence of an encoder and its approximate inverse (the decoder) allows for capabilities beyond text-to-image translation. As in GAN inversion [62,55], encoding and decoding an input image produces semantically similar output images ( Figure 3). We can also interpolate between input images by inverting interpolations of their image embeddings (Figure 4). ...
Context 2
... an image x, we can produce related images that share the same essential content but vary in other apects, such as shape and orientation ( Figure 3). To do this, we apply the decoder to the bipartite representation (z i , x T ) using DDIM with η > 0 for sampling. ...

Similar publications

Preprint
Full-text available
Multi-modal Contrastive Representation (MCR) learning aims to encode different modalities into a semantically aligned shared space. This paradigm shows remarkable generalization ability on numerous downstream tasks across various modalities. However, the reliance on massive high-quality data pairs limits its further development on more modalities....