• Popcorn Points determine how popular a video is. You can click the popcorn bucket or simply react (Like, Love, etc.) and it will register a vote.

AI Layer Test 3


Here are some random clips from the third stage of AI post layer testing. It's a long way from working correctly, but you can see a lot of improvement from the first version, with this one being far more stable. This final one will of course need to be nearly 100% stable.

These tests are run at 1/8 of final resolution, for the sake of speed, since it's an iterative process requiring many experiments. It's a very complex problem, with many possible solutions (we're testing various combinations of code from hundreds of competing research papers). Right now I'm thinking that using optical flow to "remember" stylization solves from frame to frame might be the key.

Some people ask why I'm doing this layer at all. It's extremely important. Basically, this functions as a consolidating layer that will ultimately auto correct all compositing errors and asset mismatches across all frames project wide. I can buy a stock footage clip, drop it into the background of a UE5 project, and use this to combine them seamlessly with one click. That means that this layer, once complete, will function as a universal adapter, seamlessly joining the output of not only different programs, but different media sources. In example, since this is a 100% refabrication of every pixel, I can use any source photo without copywrite infringement. So I can take any frame from google maps street view, and use that as a background for an animated scene. Same with any footage. If I need a car to break down on a road in front of Everest, I can build the car and road in UE5, grab any photo of Everest, and use this layer to combine them seamlessly, reconstituted into a completely original work.

In one of the first shots, a crowd of people can be seen walking along a spaceport causeway. Those people come from archvis packs, a different, poorly matched style that doesn't quite fit the look of the spaceport, which is from a different creator using a different style. In the version above with the primitive alpha layer, you can see that everything on the screen now looks like it was drawn by the same artist. It's automated tonal matching. This won't shave hours off of production, it will shave years.

Here is the second test, which I never bothered to post. They are all still terrible, but you can start to see where I'm going with all this.


For reference, or anyone who never saw it, this was the first test, a few months back. You can see that we've improved the coherency a lot since this one.

 
Upvote 1
So, where are we right now. The AI seems to be reproducing the image as close to original as it can. I was thinking that it was trying to do something to the image. Was this a single video you fed it or a bunch of layers that the AI put together?
 
As the core process is refined, it's likely that we can push stylization much further. Still there are issues where it absolutely has to stay tethered to the source pixels pretty closely. Think for example of one of my signature shots where the lens dilates during a dolly move. If I allow the code enough leeway, it will re interpret everything on the screen halfway through the dilation, and break the illusion.

It's pretty early in the game for this particular part of the pipeline, I basically just started on it, but others in the academic community have been working on it for years. Chaining and adjusting the existing solutions into a working formula is really just a first step. None of this is commercially available, so this is all about chaining python code from 100 different developers into something that works. Nvidia supplied 2 puzzle pieces, the university of Rio De Janeiro supplied another, and so on. I got most of my leads from a channel called 2 minute papers, which I have been watching religiously for years.

So it's one step at a time, right now I've got it functioning on a very basic level, but am having difficulty keeping the formula stationary as resolution increases. I think the core model is trained on mostly 720p images by the original devs. The final solution will likely involve creating an entirely new source model, trained on 4k blu ray movie frames.

In the stills you posted, it doesn't look too impressive, but in motion, there are clearly scenes where it's effectively transitioning the CG into a semi cartoon look.
 
I'll show an example of one of those potential chain pieces. Let's say I solve for stylization over movement, and get it smooth, with 5 other code layers, but at the end, the picture is noisy. If that were the case, I could install this node into the chain, and repair noise as it occurs, and correct it before the frame is rendered. So in the end this will be a chain of AI nodes, forming a "recipe" for a visually pleasing image that's been reconstituted to a degree that auto conforms disparate content into a cohesive whole.

 
Yeah pretty damn cool, that will open up a whole new world for indie developers when we can use real actors in CGI locations.
Won't be long now.
Here's what it looks like when indie filmmakers try to do this without an AI layer.


Obviously the AI wouldn't be able to fix Breen's terrible acting, directing, cinematography, etc, but what it would do is make it look like the characters and background came from the same universe.
 
The stills I posted were just for comparison. No judgement. I know what you mean though. It's like when I show someone something that's in the development stage but they think it's the finished product..

AI nodes. That makes sense to me. Need to solve a problem? plug in the right node. To me it's the most intuitive way to work.

That breen video could look 100% better if the guy took some time to learn proper green screen techniques. Unfortunately, too many people think it's just a matter of sampling the green then clicking a single button. Good keying often takes multiple keyers, animated mattes, and skill. There is no one keyer that does everything. They each have their strong points and weak points.

I'm intrigued with your AI en devours. I have no judgement or opinion because I don't really know anything. It seems that one of your main goals is to be able to automatically create seamless renderings, not just from different layers but also different quality models.
 
Here's what it looks like when indie filmmakers try to do this without an AI layer.


Obviously the AI wouldn't be able to fix Breen's terrible acting, directing, cinematography, etc, but what it would do is make it look like the characters and background came from the same universe.

Perfect! you barely even have to do anything to test this out.
Just download the video, chop 15 seconds of it into a clip and run it through your program. Viola!

I'm thinking you have some problems with faces in the wide shots.. but maybe for the close-ups it would be okay?
unfortunately this film doesn't have any extreme closeups.
 
Some of your questions could be answered by just looking at the video "First Steps" at the bottom of the original post. That was the first attempt, and in that case I did allow the AI to riff a lot more. In that one, you can see that it can draw a much more convincing stylization, but the issue is that the stability is way too low to use in production. So moving forward it's a struggle to get the best of both worlds, creative stylization as seen in individual frames of the first video, along with a more controlled and stable approach seen progressing in the later videos.

All in all, it's a really difficult problem, but if I can solve it, the benefits could potentially be immense.

It's also of some significance that cell animation fares way better under compression than live action or cgi. If you check the file size on an animated episode on netflix at resolution X, it's half the bitrate, but still looks like the same quality. This is a big deal for a youtube based platform, since they downgrade bitrate quite a bit. Long story short, the stylized frames should suffer less from compression.
 
Perfect! you barely even have to do anything to test this out.
Just download the video, chop 15 seconds of it into a clip and run it through your program. Viola!

I'm thinking you have some problems with faces in the wide shots.. but maybe for the close-ups it would be okay?
unfortunately this film doesn't have any extreme closeups.
Faces are a problem in and of themselves, but one that's already been solved in several ways. I just haven't had the time to implement that side of the solve yet. I can double testing time and partially solve it right now, by just activating a script someone wrote to solve that. There is however a much stronger solution, and I'm planning to test it soon. If I'm willing to run characters and scenes on separate layers, I can do a pretty good solve right now. I'd like to have a one shot filter chain, but there is also the possibility of just making a node that automatically splits the signals into parallel processing chains and then recombines them under a final polish layer.

And yeah, I'm going to run some tests on stuff like this breen clip. Combine this AI layer with AI rotoscoping, and you could probably start producing a hit youtube series where batman kept guest starring in clips from other shows, etc.
 
Last edited:
Faces are a problem in and of themselves, but one that's already been solved in several ways. I just haven't had the time to implement that side of the solve yet. I can double testing time and partially solve it right now, by just activating a script someone wrote to solve that. There is however a much stronger solution, and I'm planning to test it soon. If I'm willing to run characters and scenes on separate layers, I can do a pretty good solve right now. I'd like to have a one shot filter chain, but there is also the possibility of just making a node that automatically splits the signals into parallel processing chains and then recombines them under a final polish layer.

Well whenever you implement the face solution i'd love to see this combination of cgi and human actors combined into an AI layer.
There are enormous ramifications.

I was thinking about it last night struggling to fall asleep
 
Well whenever you implement the face solution i'd love to see this combination of cgi and human actors combined into an AI layer.
There are enormous ramifications.

I was thinking about it last night struggling to fall asleep
I don't think anyone has fully grasped the ramifications of this tech yet. If successfully implemented, it would tear down many of the walls that have prevented indie filmmakers from creating commercial ready products at a civilian budget.

People will still need to be able to write, act, and direct, but there are things that once cost millions of dollars that could be accomplished for 10 bucks once this works. The Simpsons for example has been paying around 200 people to tween frames for decades. It's actually a surprisingly large percentage of the South Korean animation market's yearly income.
 
Here's another test from today. I feel like this one is a failure, but learning what doesn't work is almost as important as learning what does. This one has really ugly crawl in the details, even though I managed to eliminate most of the larger crawl. Looks worse IMO. I did get some stronger contrast and outline, but overall, it's terrible.

 
I don't think anyone has fully grasped the ramifications of this tech yet. If successfully implemented, it would tear down many of the walls that have prevented indie filmmakers from creating commercial ready products at a civilian budget.

People will still need to be able to write, act, and direct, but there are things that once cost millions of dollars that could be accomplished for 10 bucks once this works. The Simpsons for example has been paying around 200 people to tween frames for decades. It's actually a surprisingly large percentage of the South Korean animation market's yearly income.

Yeah this will EVENTUALLY get to the point where you can input a cartoon like the simpsons and generate a live action version.
Gonna be insane. Personally i just want to use CGI environments with real actors, and that should be in a year or two at this rate.
 
Last edited:
I'd be really curious to see 10 seconds of this converted, video timestamped.

I suspect the other huge leap forward is - make shitty props look amazing.
 
Last edited:
Nate, this is really cool what you are doing! But those AI sitcom videos.. they're terrifying. Like seeing past the cosmos in an H.P. Lovecraft story or an episode of Tales from the darkside.
 
Nate, this is really cool what you are doing! But those AI sitcom videos.. they're terrifying. Like seeing past the cosmos in an H.P. Lovecraft story or an episode of Tales from the darkside.
Lol. That statement makes me think you'll probably find the main plotline of Save Point fairly unsettling. Take almost anything in the world and look at it in a completely new light, and you start getting this feeling that you're not as tethered to reality or sanity as you think.

Some kid in my sixth grade class apparently had thousands of these things building an ecosystem on his head. Had I been able to see them as clear as they are in this photo, I probably would have been horrified. The thing is, on a scientific level, this photograph of a monster is more real than my perception, which was that jimmy scratched his head a lot, and there were no horrifying monsters in sight. We believe that we are in this safe world, imagining horrible things for fun, but is it possible that we live in a horrible world, and imagine safety for fun?

1674184811337.png
 

Attachments

  • 1674184190266.png
    1674184190266.png
    1.3 MB · Views: 92
  • 1674184366740.png
    1674184366740.png
    455.4 KB · Views: 79
I'd be really curious to see 10 seconds of this converted, video timestamped.

I suspect the other huge leap forward is - make shitty props look amazing.
I'll probably start tearing down some of these videos into frames and testing them. I just have to reinstall AME, which I haven't had occasion to use since moving to the new computer.

Making bad props look great is what I was referring to above when I talked about normalizing high and low quality assets.

As far as "they" getting this working in a few years. I'm not so sure. About 3 guys took a whack at it 4 years ago, and a lot of people have done things that are related, but as far as I can tell, I'm the only one working on this particular use case. It's really strange too, because it's insanely valuable. If you go through that 2 minute papers channel for 3 years, you can see that less than a fraction of a percent of total ai vis research is going towards animated frame source refabrication. Probably 10-20 unfunded people in the world working on it right now. It's a very different problem than text to image, and other formulas you've seen. The big issue holding it back is that the groups of people who work on these python scripts and models are all math and science types, and not really understanding the processes of film, I think they don't understand the value of this particular solution.

What's happening here is that they are solving a bunch of other stuff, like AI upres with detail restoration, and I'm chaining together partial solutions to try and make this work. It's not copy paste, though I can see where it sounds like that from my descriptions. It's more like buying a rack full of guitar effects, and then tweaking a hundred settings on 10 units from different manufacturers, with the goal of getting a signature guitar sound. Lol, that's literally how I spent the whole day two days ago.

Anyway, here's another quick test reel. This time I upped the stylization too high for animation, and just let it redraw individual frames. It's not useful to me directly, but you can see how many images under different lighting or source scenarios can start to feel more themed using the AI.

 
Last edited:
As far as "they" getting this working in a few years. I'm not so sure. About 3 guys took a whack at it 4 years ago

Dall-e was introduced to the world in 2021 and you're talking about stuff from 2018-2019?
Frankly - Who cares what people were doing before the age of AI..

Shit has changed. It's a whole new world now.

The biggest issue that I can see is the poor rendering of human faces - once that is cracked then you've got something totally suitable for a lot of use cases already ready to go. Looks awesome just need faces better and I have to believe that faces are a focus of AI developers.

Edit

Making bad props look great is what I was referring to above when I talked about normalizing high and low quality assets.

Yeah I was taking that a step further beyond assets to props and costumes.
Like you have a guy dress up a in crappy bigfoot costume, and then AI converts it into a believable bigfoot with all of his acting and nuance.

maybe it seems obvious to you but a lot of people reading this thread are not going to make that connection with the technology
this stuff will work with more than vfx it is going to enhance sfx too. very exciting time.
 
Last edited:
Trying to think of how this can benefit you in the short term, here is one idea..

If you can reproduce something like that caddyshack and make the gopher look better, and then produce something like a real actors talking aboard a 12 million dollar superyacht, those two videos alone with this revolutionary technology, you could easily kickstart a youtube channel talking about how to do 3 point lighting, yadaada whatever basics, but two videos like that I mentioned above would definitely put you on the map imo.
 
Just another random Midjourney video,


If I can make it work eventually, there's no shortage of ideas on how to monetize it. I think you're still thinking too small. This could be used to make animated versions of classic movies, update old and damaged tv and film in a way that upscaling cannot, etc. But the biggest deal would be the democratizing effect.

I understand why you would think that about Dall E, but you're way off. A 747 is a way more advanced type of self mobilizing machinery than a combine, but if you want to harvest crops, a 747 is basically useless. Dall E got overhyped anyway. I've run literally a thousand experiments on it by now, and those promo images happen maybe 1 in 100 times. For this process, we are limited to using stuff that works 100% of the time. Not only that, it needs to flow frame to frame for thousands of frames. Stuff like Dall E and Midjourney can't even flow for 2 frames. To be clear, none of my above examples have any frame flow, otherwise they would look 10x better. Basically, if I can implement this frame flow, it should be at a usable level already, then it can be improved over time.

There is this piece of code, that seems to have been abandoned, that was porting the frame data through thee nvidia optical flow cuda process and they had started to stabilize the stylization layer.

Hold on, I'll find it.

This is the bad way, I used it for a bit a few years back, while working on this same problem, It's called EBSynth, it's free, and it works well. You can work on a single shot for 8 hours, and it will basically look the way you want. What I need is fire and forget batch solving under changing lighting conditions.


here it is, 2016 some guy started to solve this, and then kind of disappeared. This was his code wired to a much earlier transfer AI, which is now outdated. So what I have to do now is to implement a version of this repository with a newer codebase for the stylization. I'm using Stable diffusion right now, because people have done a lot of work to make it modular with many samplers and mod scripts, but I may have to switch engines dependent on what works best with the motion data passthrough.

You can find the original paper on his methodology here at the Cornell public archives -



You can see in the video that he only wired it up to one of the old school style transfer methods, which produces unusable results.

I actually think that even further steps are going to be needed, and that this new formula will need to look at batches within a sequence, lock down an approach, and interpolate between staged approaches to frame solves, so just like EBsynth, but automated.

Anyway, there is nothing on the market that does this. I just don't see why it can't be done using a combination of a few existing approaches.
 
Last edited:
Back
Top