Just over a year ago, San Francisco-based OpenAI – known around the world for creating the GPT-3 language model – introduced the DALL-E, an artificial intelligence system capable of creating images starting with a text caption. DALL-E (a name inspired by Dali artist and Pixar WALL-E character) has allowed users to creatively explore the world of artificial intelligence, offering a fascinating glimpse into the future of art created through artificial intelligence.
A few weeks ago, OpenAI released an update to this project, simply called DALL-E 2, capable of creating even higher quality creations and manipulating images in ways never before seen in an AI system. The innovations of this version are the ability to make selective changes, the ability to create images similar but different from the original and the technology with which the genetic neural network operates.
The assembly line behind the DALL-E 2
Technically, the DALL-E 2 is not a single neural network, but a series of artificial intelligence models doing their job in turn, as described by research published by the company. First there is a model – called CLIP (Contrastive Language-Image Pre-training) – that assigns a text caption to a rendering space (a coding function for mathematical text representation), then another model matches that text encoding together with a visual coding – an image – that captures the semantic information matched by the caption. In short, the system uses a method that allows the text and image to be statistically similar, so as to allow the next model to “design” the most accurate image according to the user-provided caption.
This model is called GLIDE and thanks to the preparatory work of CLIP it has the ability to create an image faithful to the text that appears in the caption, showing the user the desired content. But it does not end here: since the image produced can only have dimensions of 64 × 64 pixels, actually too small to allow normal use, two models sampling – this is the ability to create high definition images starting from lower resolutions – bring the first image to the intermediate resolution of 256 × 256 pixels, then with a new step towards the final resolution of 1024 × 1024 pixels. At this point the generated image is ready to be presented to the user. These steps were necessary because the immediate creation of images with GLIDE with dimensions larger than 64 × 64 pixels would result in a significant increase in the necessary computing resources.
What differentiates the DALL-E 2 from its predecessor is the image creation technology, which is now based on a diffusion model.
They are not brand new, but they are becoming more and more established in the world of genetic artificial intelligence. Diffusion models create images by reversing a Gaussian noise process. Visually it is like watching an image slowly emerge from a cloud of disturbance, similar to what you saw on old TVs.
We started talking about them already in 2015, with the study “Deep Unsupervised Learning using Nonequilibrium Thermodynamics”, but the fundamental study to confirm this method was “Denoise Diffusion Probabilistic Models”, which showed how it is possible to use this new technique. to create high quality images even better than GAN (Generative Adversarial Network), rival production networks used today primarily to create digital faces of people who do not exist. Among other things, this is clearly described by a specific study, entitled “Diffusion Models Beat GANs on Image Synthesis”, where it is possible to observe numerous examples where diffusion models produce better results than GANs.
Diffusion models are inspired by thermodynamics and learn to create images through training where they observe a diffusion process that destroys a signal by increasing noise. This allows the neural network to learn to predict the component of the disturbance in a given signal (such as an image), then reducing it by applying the reverse process, where the image is generated by noise.
Examples and possibilities
As with its predecessor, the DALL-E 2 is also presented with several interesting examples and features. To create an image, simply write a caption with what you would like to see and the model will generate some ideas. Like this “astronaut on horseback”:
Or like these jokes “statues bottled at rush hour“:
But if you want, you can also use DALL-E 2 to create your own very personal comics. A user asked the system to design a “Anime style illustration of a blue haired nun waving a katana in the woods“:
With DALL-E 2, compared to the previous version, things can become even more interesting. The system is, in fact, able to modify an image provided by the user – or even created by the AI itself – by adding or modifying objects as desired. Take for example this empty room, where the user has highlighted a specific point (circled in red):
When prompted to place a sofa in the designated area, DALL-E 2 returns the following image:
It goes without saying that the use cases for architects and designers will be countless. But the system is also able to accurately adjust a given image, such as this rabbit photo provided by a user:
This user simply highlighted the area affected by the mouse modification, in this case the rabbit’s hind legs, and wrote “Frog legs”, taking this effect:
The AI system created an incredible rabbit-frog chimera, modifying the back of the animal so that it is graphically consistent with the previous image, using the same shade of color as the fur and creating the part of the carpet from the beginning. covered by real rabbit legs.
The ability to make gradual changes, select affected areas, and request changes or additions is one of the most important new features of this new version of the system. Along with the possibility of creating “variations” of provided images. Suppose, for example, we like the following photo:
In our example, however, this photo is copyrighted and we may not use it for our article, illustration or brochure. However, just ask DALL-E 2 to create a variant, and here are the different versions:
As you can see, they are all similar to the previous image, but also all different. The photographer who captured the original image will probably not recognize his own shot, while the hypothetical author of the article or pamphlet will be able to use an image that communicates the same style and atmosphere as the one he wanted to use, but with the well – there was hope that we would not face copyright issues.
As for the previous version, DALL-E 2 also turns a blind eye to creative, designers, everyone who works in the world of images. Being able to describe an image and create it in minutes, with the ability of artificial intelligence to make selective changes, is the dream of every creative with minimal graphic design skills. But not only the less talented will turn to automatic image creation, as the system will prove very useful in speeding up the work even of those who know how to design well but have less and less time to do so.
In addition to creating images, photographs, paintings and comics, the DALL-E 2 will also be useful to room designers, to conform to the fantasies of customers or by the customers themselves, who can independently and easily create different ideas furniture for their home. In the future, the same technology may be used by video game developers to create virtual environments and digital objects, and obviously by designers in the post-universe for the same reasons. Obviously, in these cases, you will need the ability – which is currently lacking in DALL-E – to consistently create 3D and riveting structures, but with Microsoft’s large investment in OpenAI ($ 1 billion) and acquisition by Redmond giant ZeniMax / Bethesda and acquisition of Activision Blizzard, it is reasonable to think that the significant research in genetic artificial intelligence conducted by OpenAI could one day be used for Microsoft Galaxy video games.