Fun with Fairy Kei uniforms [Midjourney]

Thelsim@sh.itjust.works · 1 year ago

Fun with Fairy Kei uniforms [Midjourney]

tal · edit-2 1 year ago

That first firearm looks better than I normally get in Stable Diffusion, but it’s still a little odd – like, there’s no trigger, for example, and it’s being held as if there’s a pistol grip, but no pistol grip is coming out the bottom of her hand.

One thing that current SD models I’ve played with don’t do well is firearms. I’ve tried dealing with that by specifying particular models of firearms, but same thing. They tend to produce images that combine parts from unrelated firearms that don’t make a lot of sense together. Sometimes pieces are backwards, sometimes they’re in bizarre places.

Lemme do a few montages to demonstrate:

With just “rifle”, I get something that looks a lot like an M-14:

View Full Size

But it’s a mess of multiple magazines, scopes facing backwards, multiple triggers, bipods with missing legs, stocks on both ends, rifles mounted as scopes on other rifles, etc.

I thought that trying to specify a precise firearm model might avoid the problem. A Remington 870 is an extremely-common firearm; there should be a lot of images of it, so hopefully there’s enough training data to do something reasonable with it alone. But it’s still pretty much a mess with “remington 870 shotgun”:

View Full Size

I don’t think that the issue is an inadequate training set size, because there’s plenty of variety in the images. I think that the problem is that there are certain things that the generic algorithms that the LLMs are currently using don’t do terribly well with certain things that humans are particularly-sensitive to looking wrong on. To the LLM, certain things that look very similar look very different to us. Fingers and toes are a famous example. In many images, there’s nothing wrong with adding a few more of something. Have a cornfield, and whether there are five or six rows of similar corn doesn’t matter much. But with a human hand, we care a lot about whether there are five or six fingers.

Same thing with firearms. Lots of kind of similar-looking portions of objects, but some of them go together in ways that we just don’t like.

Maybe LLMs could incorporate some kind of training on “bad” images, things that are undesirable, and we could flag images with too-many fingers as undesirable.

Problem is, that right now they can generally assume that images out there are good, and nobody wants to manually create a “bad” training corpus, and it’d be a huge amount of work.

Early on, search engines tried figuring out whether their given search results were good by asking users. Users generally didn’t care about spending time to rank search engine results, but IIRC Google realized that one could probably infer some information about whether a result was good or not if a user stopped searching for the thing after they found it. Maybe there’s some way to infer similar information from public LLM services like Midjourney or DALL-E. If so, that could maybe be used to cheaply build a “bad” corpus.

Thelsim@sh.itjust.works · 1 year ago

Oh these were surprisingly accurate yes. But usually I get the same kind of weird results, droopy gun syndrome being one of my favorites :)

I know Midjourney has a rating system on their website and an incentive for you to vote. But I’m not sure what they actually do with that information.

KeenFlame@feddit.nu · 1 year ago

Removed by mod

tal · edit-2 1 year ago

I’d also add that it’s not, I think, just a matter of learning that rifles never have two stocks facing in opposite directions in real life by throwing more training data of good rifles at it. I mean, I recall a very beautiful AI-generated image of a slope of a green hill that was merging into an ocean wave. It was very aesthetically-pleasing. But…it’s not something that would ever happen in real life, or could make sense. That’s the same as with the reverse stocks on a rifle. Yet we like the hill-wave, but dislike the reverse firearm stocks. It’s not clear to me whether there’s a great set of existing information out there that would let a generative AI distinguish between the two classes of image.

It is one area where human artists do well – they can use their own aesthetic sense to have a feel for what looks attractive, use that as a baseline. That’s not perfect – what the artist likes, a particular viewer might not like. But it’s a pretty good starting place. A generative AI has to be able to create new images, but without having an easy sense for what combinations might be unattractive.

I think that one of the interesting things with generative AIs is going to be not just finding what they do well – and they do some things astoundingly (to me) well, like imitating an artist’s style or combining wildly-disparate images in interesting ways. It’s going to be figuring out a number of things that we think are easy that are actually really hard.

I’m not sure whether making a rifle is going to be one of those – maybe there’s a great way to do that. But there are gonna be some things that are gonna be hard for LLMs.

At that point, I think that we’re either gonna have to just figure out new ways of solving some of those problems – like, people hardcoded “fixes” for faces into Stable Diffusion back in the pre-XL era, as faces and especially eyes often looked a bit off. Maybe we need to move to systems that have a 3D representation of the images. Or maybe we introduce software that tends to permit for human interaction, to provide for human-assisted decisions in areas that are hard.