• Popcorn Points determine how popular a video is. You can click the popcorn bucket or simply react (Like, Love, etc.) and it will register a vote.

AI Layer Test 3


Here are some random clips from the third stage of AI post layer testing. It's a long way from working correctly, but you can see a lot of improvement from the first version, with this one being far more stable. This final one will of course need to be nearly 100% stable.

These tests are run at 1/8 of final resolution, for the sake of speed, since it's an iterative process requiring many experiments. It's a very complex problem, with many possible solutions (we're testing various combinations of code from hundreds of competing research papers). Right now I'm thinking that using optical flow to "remember" stylization solves from frame to frame might be the key.

Some people ask why I'm doing this layer at all. It's extremely important. Basically, this functions as a consolidating layer that will ultimately auto correct all compositing errors and asset mismatches across all frames project wide. I can buy a stock footage clip, drop it into the background of a UE5 project, and use this to combine them seamlessly with one click. That means that this layer, once complete, will function as a universal adapter, seamlessly joining the output of not only different programs, but different media sources. In example, since this is a 100% refabrication of every pixel, I can use any source photo without copywrite infringement. So I can take any frame from google maps street view, and use that as a background for an animated scene. Same with any footage. If I need a car to break down on a road in front of Everest, I can build the car and road in UE5, grab any photo of Everest, and use this layer to combine them seamlessly, reconstituted into a completely original work.

In one of the first shots, a crowd of people can be seen walking along a spaceport causeway. Those people come from archvis packs, a different, poorly matched style that doesn't quite fit the look of the spaceport, which is from a different creator using a different style. In the version above with the primitive alpha layer, you can see that everything on the screen now looks like it was drawn by the same artist. It's automated tonal matching. This won't shave hours off of production, it will shave years.

Here is the second test, which I never bothered to post. They are all still terrible, but you can start to see where I'm going with all this.


For reference, or anyone who never saw it, this was the first test, a few months back. You can see that we've improved the coherency a lot since this one.

 
Upvote 1
It's not intentionally doing it, it's just a side effect, but a very important one.

So say you had a really sloppy, lazy composite that took a fraction of the time a good one would take. And you could tell a robot, "ok, now draw me this exact picture with a pencil. It would be impossible to draw tiny subpixel compositing errors with a pencil, so in that process all of the errors would simply disappear.

Over time, I'll post better demos of this, at higher res, and it will become more clear. Bottom line, the AI has never been taught what a compositing error is, and doesn't know how to draw one. It's only been trained on pictures without errors, so this is basically autotune for video.
 
Last edited:
I read about that.... Personally, I don't see anything wrong with it. I mean, people from past generations loved Popeye and Bugs Bunny.
 
Another test video, at higher resolution, with slightly modified settings. Faces are still badly distorted, and the crawling effect is still way too strong, but at least in this video you can see what's going on better. Long way to go I think, but I'm getting there.

 
Have you tried mixing video like this and adding something that isn't there?
Like if you took video of a street, had the program convert it, and then add "on fire" or like "weird alien trees" instead of normal trees.
 
Have you tried mixing video like this and adding something that isn't there?
Like if you took video of a street, had the program convert it, and then add "on fire" or like "weird alien trees" instead of normal trees.
No, I can't do that, well, I can, but it wouldn't work out.

The issue is that in this formula, I'm providing the commands via image only. The text prompt just says "painting brushstrokes" or "pencil drawing" and all the rest of the data comes from input frame and settings.

The reason is that I need this to be able to do large scale batch automation, like drop in 10,000 frames and get a usable output. Since the scenes would change rapidly, any instruction that did work in one scene would immediately ruin the scene after. Also I have to tightly leash the AI to the current pixel locations, as this is what creates the noisy image drift seen in these tests.

It's kind of a constant tug of war between effective stylization and staying on target, so any deviation in that core prompt would cause the instabilities to spin out of control.

But to answer more directly, yes, I tried it. It can look cool, but only for a still frame really.
 
Last edited:
In a scene like that, did you have to define the make and model of the cars? How about the colors. There is a yellow car that stands out. did you tell it to include a yellow car? How about the people, did you have to define what they were doing?
 
well, yes to all, but not in the way you're thinking. This whole thing is about communicating with the AI through images. Text prompts are only used for describing the target style. So I 3d modeled the entire city, placed and painted each car, tree, person, etc. I assigned animations to all the characters one by one.

This part of the pipeline really only does stylization, with the purpose being to consolidate asset styles into a a uniform presentation.

One ring to rule them all. Input 10 high quality assets, 10 low quality assets, output 20 medium high quality assets, perfectly matched in the final aesthetic. It's a normalizer, if you're familiar with audio engineering.
 
Last edited:
Yeah, exactly. The big idea is to lower the bar for modularity between different artists and assets. The classic muscle cars were made by one modeler at one realism level, the stop lights by another modeler at a different realism level. With this AI layer, I'm loosing detail, but every item in the world appears to be drawn by the same artist, because post layer, they were.
 
No, I can't do that, well, I can, but it wouldn't work out.

I think it would work if you had a paranoid character that saw flashes of fire or scary stuff or whatever, if its only one frame you could do some crazy flashes.

It's a mediocre idea though - here is a different idea that may be gold - composite in a REAL actor, then apply the stylization AI layer, and then BOOM they're all mixed together right? I suspect this could this be a perfectly seemless way to blend CGI and real actors into a uniform presentation.
 
Last edited:
Nate, can you post a clip that shows the original footage then the AI processed footage? I'm kind of curious just how much influence the AI has over the look. I imagine is much more than, say, a LUT.
 
I think it would work if you had a paranoid character that saw flashes of fire or scary stuff or whatever, if its only one frame you could do some crazy flashes.

It's a mediocre idea though - here is a different idea that may be gold - composite in a REAL actor, then apply the stylization AI layer, and then BOOM they're all mixed together right? I suspect this could this be a perfectly seemless way to blend CGI and real actors into a uniform presentation.
I think you're really getting it now. This is exactly the kind of thing that will open up if I can ever fully succeed at this. A lot of testing will have to be done to determine exactly what will work and what wont, but for example.

A greenscreen stock clip of a man eating a pizza could just be inserted into a finished ue5 scene, and then no character design, no animation blueprints, no time spent.

Add a sign to a storefront by just motion tracking a jpeg onto the front sign, composite disappears in the filter. Text is drawn with the same pencil as the building, so it's seamless.

How about quick insert shots, like a hand tossing a coin? Do an entire sequence in ue5 (because I need to direct cameras and animation detailes) and then when it comes to that shot, I just use 3 seconds of stock footage, run it through this layer, and it's perfectly matched to all the other CG.

Say there is a scene at the Washington Monument. but it doesn't need any orbiting camera work. No need to build the whole scene, just have the CG actors in front of a stock photo, and use the layer to combine then.

One really interesting thing about this that I haven't gone into much is that once the technique was perfected, pencil, paint, cartoon, are just options, and the concept could be applied to many stylization ideas, including just refabricating the whole input back into photorealism, but seamless, and with extra detail.

If this all works eventually, we should be able to mix and match mediums freely, like never before.
 
Before AI
before.JPG


After AI
after.JPG
 
Last edited:
Right, but also keep in mind that these early tests are run at 1/8 resolution, so that's affecting the comparison a lot. The goal right now is to to create a smooth animation style that works with many types of scenes, then, as the process becomes more stable, I can risk spending 8x time per frame on testing.
 
Yeah pretty damn cool, that will open up a whole new world for indie developers when we can use real actors in CGI locations.
Won't be long now.
 
Back
Top