• Popcorn Points determine how popular a video is. You can click the popcorn bucket or simply react (Like, Love, etc.) and it will register a vote.

AI Layer Test 3


Here are some random clips from the third stage of AI post layer testing. It's a long way from working correctly, but you can see a lot of improvement from the first version, with this one being far more stable. This final one will of course need to be nearly 100% stable.

These tests are run at 1/8 of final resolution, for the sake of speed, since it's an iterative process requiring many experiments. It's a very complex problem, with many possible solutions (we're testing various combinations of code from hundreds of competing research papers). Right now I'm thinking that using optical flow to "remember" stylization solves from frame to frame might be the key.

Some people ask why I'm doing this layer at all. It's extremely important. Basically, this functions as a consolidating layer that will ultimately auto correct all compositing errors and asset mismatches across all frames project wide. I can buy a stock footage clip, drop it into the background of a UE5 project, and use this to combine them seamlessly with one click. That means that this layer, once complete, will function as a universal adapter, seamlessly joining the output of not only different programs, but different media sources. In example, since this is a 100% refabrication of every pixel, I can use any source photo without copywrite infringement. So I can take any frame from google maps street view, and use that as a background for an animated scene. Same with any footage. If I need a car to break down on a road in front of Everest, I can build the car and road in UE5, grab any photo of Everest, and use this layer to combine them seamlessly, reconstituted into a completely original work.

In one of the first shots, a crowd of people can be seen walking along a spaceport causeway. Those people come from archvis packs, a different, poorly matched style that doesn't quite fit the look of the spaceport, which is from a different creator using a different style. In the version above with the primitive alpha layer, you can see that everything on the screen now looks like it was drawn by the same artist. It's automated tonal matching. This won't shave hours off of production, it will shave years.

Here is the second test, which I never bothered to post. They are all still terrible, but you can start to see where I'm going with all this.


For reference, or anyone who never saw it, this was the first test, a few months back. You can see that we've improved the coherency a lot since this one.

 
Upvote 1
Just curious. Are you waiting for the technology to fully develop or is this something separate from your main project? (sorry, I can never remember the name,,, the one with the cat)
 
The cat project "The Labyrinth" is actually the side project, and this is the main project (or rather a component of it). Save Point is the main story, with human characters and a broad scope. The Labyrinth, is basically a "training wheels" project to gain experience working in a new way, and field test design aspects and production pipelines for use in the master project.

I had been experimenting with what we internally call "the McJarkanizer filter" in one way or another since about 2011. The first idea for hybrid AI choice based film came about 2005, and was initially called "Cartoon Shaman". We tried again in 2011, and made a brief story that was.... not great. At that point I was just using stuff like roto and outline filters, subtractive masking, etc.

Save Point actually launched about the time I realized that these technologies would be ready in a few years, or at least good enough that I could Frankenstein a solution for the master style layer. I knew that there was an enormous amount of setup work to be done before I'd even be ready to use that layer effectively, so that's what the last few years have been, building up the infrastructure and knowledge I would need to be able to communicate to the AI layer effectively.

So basically, I'm not aiming where my target is, I'm aiming at where it's going to be, understanding that it will take years for the bullet to reach the target. About the time I get this layer solved, I should be prepared to supply the input with enough content to validate the layer's purpose.

The idea is to crash budgetary restrictions on marketable animation, and allow high volume milling that facilitates a new type of story that simply wasn't practical at established rates of speed and cost. If you've ever started mapping out what an exponentially expanding plot looks like, it's something that just isn't possible when one has to spend weeks of time making 3 minutes of film. So in answer to your question, the knowledge that this AI layer technology was coming farther along in the timeline was the genesis of the current version project, rather than just an aspect. I knew that everyone would try to jump on board the minute it was possible, and anyone who had been preparing for it for years would have a huge advantage.
 
Pretty cool!

Except for 56-60 when it puts them in blackface for some reason.
wtf ai!
Yeah, I saw that. What's happening there is that it's trying to guess race by skin tone based on pixel luminance, and in darker scenes, it's not able to differentiate between a 72 luma value from darker skin, and a 72 from lighter skin in a darker area.

The issue is that it has no memory, so It does not realize that it painted the same character differently in a previous frame. It's something I have to get fixed at some point in development, but will probably be sort of easy since even a single second of memory would get rid of this issue.

"I have assigned shape x attribute x for 15 out of the last 24 frames, should I suddenly reassign it? Goto line 4000, run luminance shift window check, if check positive then return to line and set value negative"

Also the fog in this scene probably gave the AI a lot of trouble. We know what fog is, but to the AI, it's just noise on top of the frame.
 
Last edited:
and here's the next test. Doubled the resolution (this one took 8 hours) and tried to dial it back a bit to get a more cohesive look.

Near the end there is a section where for a moment, the whole thing is almost working perfectly, but it's still got a long way to go.

it's still processing as of this post, but should be viewable about 15 minutes from when I post this. I timestamped it to the part where it actually works for a minute.

 
Last edited:
As far as "they" getting this working in a few years. I'm not so sure. About 3 guys took a whack at it 4 years ago, and a lot of people have done things that are related, but as far as I can tell, I'm the only one working on this particular use case. It's really strange too, because it's insanely valuable.

Mmhmmm [reddit link] of real person dnacing, transformed into animation, with no flickering.
Looks like I was way too conservative with my guess of a few years and they got it in a few months

 
Last edited:
I think the disconnect is that you don't understand the tech enough. When I talk to people about this, over and over they send me one of about 20 videos of an anime girl dancing in place that someone made. They don't understand that an anime girl is literally the only subject that the youtuber or whatever can render at that quality, they don't get that if the camera moved around too much the whole thing would melt down in noise, it's just someone showing off a pretty cheap trick. Haven't you ever wondered why there's 100 of these dancing girl videos and not even one actual film yet?

The flickering was never a problem. At least the luma flickering, that's what's gone here in your clip. It's just a single drag and drop effect locked behind the davinchi resolve paywall, which I paid last month after I paid off the PSVR2. So that particular issue isn't what I was talking about, but I've also solved the other one that I was talking about in the intervening time, which requires another program. Right now my pipeline runs across 6 programs using python scripts and some sections have many stages. The last couple videos you sent are just SD outputs with some compositing and a single deflicker layer. Right now the other flickering, the line pixel displacement over time flickering, I'm solving with an ebsynth midstage, which was the plan from way back. I actually have EBsynth tests from like 2019. Long story short, take these videos with a grain of salt, because last month about 200 DK people announced that they had solved this, and literally not one of them is even close. There are people that are getting close though, at huge functioning companies like microsoft and google, and those are the people I actually do have to worry about. Also some of the legit research scientists, but they have been screwing around with this since 2014 with remarkably little progress.

Here's what you might think is a finished solution at first glance, available free 3 years ago. This is the problem with trying to infer the state of complex technologies from a cursory examination of a short video. From appearances, and even the dialogue of the video, you'd be forgiven for thinking we had this whole thing solved in 2019 right? So why did I spend over 100 hours running tests on this specific solution over the last month? Because it works, but is so work intensive and unstable that I could never complete a pro work with this.


Here is the only video from last month that's even close, and it's not even close to a full solution. They had to do a lot of cheating to make it work, in terms of just stealing frames from anime shows, doing composites and just saying it was AI because it was 15% AI, that sort of thing. But it is closer than anything else out there to the method I've been building. They spent a week or two with a paid crew making a short, dumb joke film, and moved on. That's the top youtube competitor in the world right now in this area, and they aren't even half serious about it.


Try to see it from my perspective -

"Hey I'm Sean, and I just spent 2 years writing a novel about a wizard school"

"Duh, too late, Jake Paul just uploaded a 5 second gif of a wizard school that he worked on for an hour yesterday, geez Sean, I thought you were smarter than this, get your own idea."

"It's not really the same thing, you see I'm working on a serious book where I really have to think about the structure of the plot to make sure it's viable beyond some 5 second flash in the pan image, you know, a novel, not a gif, it's different in so many ways and requires so much more to actually work"

"I don't know what all that means but I'm just saying Jake came up with that wizard school thing and his is already done and published, so it's like Jake Paul 1 Sean 0 right?"

"I'm not writing a 300 page GIF though, so you can understand that it's not really a valid comparison right? Even if I was, you get that a GIF that lasted 20 hours would require a lot more infrastructure than one that only needed to work for a few minutes?"

------

Basically I was past this level seen in the video a long while back, essentially as soon as control net was published anyone could do this part by training a custom model on their footage, and then assigning it to a narrow LORA or textural inversion to get a fidelity output. That's great when you have a free model library that by the way is 99.8 percent waifu drawing styles, and literally almost nothing else. Should we start making movies now that this guy has solved it ahead of me? The movies can't have cars in them, because there are no models out for moving cars, so they can't be retargeted like this. How about having a door open in your movie? This person's method doesn't do doors. It's a one off quick hack that makes this one scene work one time.

What I need for my work is a system that doesn't break down under any circumstance, and can be automated to a waaaaaay larger degree than what you're seeing here, which requires individual work and attention for every single element to function.

Their idea of building a leverage device for scooping

1681323310800.png


My idea for building a leverage device for scooping.

1681323405978.png


Their idea of a 4k camera

1681323480903.png


My actual 4k camera, gimbal, and crane

1681323708596.png


Same words, very different reality, if you take my point. I think you'll find a lot of 2 minute Harry Potter spoofs on youtube, and it should be easy for you to understand the difference between what you are doing, and what they are doing.
 
Back
Top