Image generation is one of the most mature forms of artificial intelligence creation able to turn a simple idea into a graphic photograph or image of any kind
Well, the underlying technology is fairly mature. There are still strong distinctions between one model and the next and even the way one company might deploy the same version of a model in a completely different way to another company.
In some areas, there’s a lot of convergence particularly around hyperrealistic human faces but in others, there are distinct differences especially in things like text rendering skin texture and prompt following
To get a better idea of how AI might handle fairly complex prompts I’ve given the same free requests to 7 of the leading AI image generators including DALL-E, Flux, Ideogram, Mystic, Phoenix, Midjourney and Haiper.
Creating the prompts
Are we entering the era where instead of paying an influencer to promote its products, a brand will just generate one with AI that matches their aesthetic?These are some experiments this morning using Flux and @runwayml Gen-3 Alpha. pic.twitter.com/7VvscImorwAugust 11, 2024
There are likely more models that I’ve excluded than included, including the incredibly powerful Imagen 3 from Google and Meta's Imagine AI. The reason for their exclusion is that they are not as widely available globally as the ones I included.
The three prompts are fairly distinct; the first causes for for a complex scene creation with elements in specific places, the second makes specific requirements for text rendering and the third focuses on skin texture and realism.
If you disagree with any of my decisions or want to try out prompts with specific settings (I ran them all using defaults) I've included the prompts in full.
Prompt one: The young woman
An ultra-realistic smartphone selfie of a young woman in her mid-20s. The photo has the characteristic sharpness and vivid color of a high-end smartphone camera, with slight motion blur on one edge. The image is taken in natural daylight, causing mild overexposure on one side of her face. She has shoulder-length curly hair with grown-out highlights, and wears minimal, everyday makeup with slightly smudged eyeliner. Her expression is a genuine, slightly lopsided smile with a hint of tiredness around her eyes. She's wearing a comfortable, well-worn graphic t-shirt with a faded band logo. A thin silver necklace is partially tangled in her hair near her collar. The background is a lived-in studio apartment, with a unmade bed and a small bookshelf visible. A houseplant with a few yellowing leaves sits on a windowsill behind her. There's a small coffee stain barely visible on the collar of her shirt.
Midjourney
I used all default settings for all of these prompts which unfortunately does a disservice to Midjourney, which is the most customizable of all AI image models. Here it missed some of the points of the prompt because of its default behavior to make things perfect. That said I think it created a brilliant depiction of the woman.
DALL-E
DALL-E is barely in the race when testing for prompts showing real people as it makes everyone look a little like a BRATZ doll.
Ideogram
Ideogram did a good job of following the 'imperfections' element of the prompt but overdid it on the motion blur — slightly. However, I think this is the most natural of all the images of people.
Freepik Mystic
I like the lighting from Mystic and the woman looks the most realistic. The prompt was followed well but there is a degree of uncanny valley. It also has the 'too perfect' issue of Midjourney.
Flux (using Grok)
Flux might be my favorite overall image. I don't think its the best in terms of prompt adherence or realistic depiction but it is good and looks generally more believable.
Leonardo Phoenix
I really did believe this one was a real photo. It captured the imperfections perfectly but the lighting is still slightly off and the framing is weird.
Haiper
Haiper did a good job but it didn't get the lighting right and the skin is too 'perfect'. Otherwise this is my favorite character generated out of the set.
Winner: Ideogram
Prompt two: Penny Lane
A bustling 1960s London street scene on a rainy afternoon. The street is lined with iconic red double-decker buses, black cabs, and people holding colourful umbrellas. A Beatles-inspired band performs on a street corner, with their instruments reflecting in the wet pavement. In the background, Big Ben is visible through a light fog. A neon sign above a small café reads 'Penny Lane' in glowing letters. On the right, a woman in a stylish 1960s dress is waiting for the bus, holding a newspaper with the headline 'Man Walks on Moon.' Raindrops are visibly falling, creating ripples in puddles, and the whole scene has a blend of nostalgia and realism.
Midjourney
Midjourney did a good job of following the scene and 'tried' to render the sign accurately but mixed up the two requests for text.
DALL-E
Again, DALL-E tried to display the text but failed to render it accurately mixing up the two different statements in strange ways. The scene was also more cartoonish than the others.
Ideogram
Ideogram is the only one that got it pretty much spot on. It rendered Penny Lane and provided a newspaper with a woman holding it. Its visual isn't as atmospheric as Midjourney but the scene structure is better.
Freepik Mystic
Mystic, which is based on the Flux model but with some additional fine-tuning is also impressive. It correctly rendered the text and put a woman with the newspaper. Visual is better than Ideogram, scene structure not as good as the woman is in the road.
Flux (using Grok)
Flux, generated using Grok, was surprisingly clever as it had the woman with the newspaper, put the words Penny Lane and the Beatles in a billboard along with Man Walks on the Moon. However, while the visual is good, the scene structure is terrible including creating two Elizabeth Towers (Big Ben).
Leonardo Phoenix
Leonardo's Phoenix probably had the best prompt adherence of any of the models I tried. It also had impressive text rendering but its visual look, scene creation and faces were terrible.
Haiper
Haiper had the best visual and atmosphere but it didn't even attempt the text and ignored many elements of the prompt itself, so also failed prompt adherence.
Winner: Ideogram
Prompt three: Victorian London
A bustling Victorian-era London street at twilight, with horse-drawn carriages moving through the cobblestone roads. A well-dressed woman in a crimson dress and bonnet is standing under a gas street lamp, reading a folded newspaper with the headline: 'New Inventions Change the World!'. The glow of the lamp casts a warm light on her face. Steam rises from a nearby street vendor's cart selling roasted chestnuts, while children in tattered clothes run playfully in the background. In the distance, the clock tower of Big Ben looms, half-covered in a misty fog. The realism should highlight the textures of the street, the detailed facial expressions of the woman, and the subtle nuances of the mist and lighting.
Midjourney
Midjourney captured the basics of the scene including accurately rendering the woman in the bonnet although it seems to have rendered fog as smoke.
DALL-E
DALL-E didn't attempt the text but it did accurately capture the scene. Again it was a little more on the cartoon side rather than the realistic. It looks somewhat like a Victorian postcard.
Ideogram
Ideogram did a reasonable job of rendering the frame. Not a fan of the slight cartoon-ish feel or the kids in the street, but the woman looks natural and it almost gets the text.
Freepik Mystic
Mystic was the best overall image as it accurately depicted the scene, had a very realistic vibe but did fail on the text rendering.
Flux (using Grok)
Flux (in Grok) did the best job of displaying the text on the newspaper and even generated the image in such a way that the way the woman reading the paper is more natural.
Leonardo Phoenix
Leonardo Phoenix accurately framed the scene and captured the writing on the newspaper but it did have a cartoon-like feel to the image.
Haiper
The scene from Haiper feels a lot more real, not attempting to show London by displaying Big Ben. Rather it seems to show a Victorian-era scene including the gas lights and horse and cart on cobbled streets.
Winner: Flux (in Grok)
The winner: Ideogram
When it comes to rendering individuals using AI it is clear the top models are starting to converge, with very similar-looking characters appearing from the same prompt across different tools. The ability to render text is still variable with only Ideogram largely consistent.
While I gave the win to Ideogram the whole thing was largely subjective and there were enough differences between one model and another that almost any of them could have won any category. The exception was DALL-E which is feeling its age.