Microsoft’s New AI: Virtual Humans Became Real! 🤯

Dear Fellow Scholars, this is Two Minute Papers
with Dr. Károly Zsolnai-Fehér. Today, we are going to see Microsoft’s AI
looking at a lot of people who don’t exist, and then, we will see that these virtual people
can teach it something about real people. Now, through the power of computer graphics
algorithms, we are able to create virtual worlds, and of course, within those virtual
worlds, virtual humans too.

So, here is a wacky idea. If we have all this virtual data, why not
use these instead of real photos to train a new AI to do useful things with them? Hmm…wait a second. Maybe this idea is not so wacky after all. Especially because we can generate as many
of these virtual humans as we wish, and all this data is perfectly annotated.

The location and shape of the eyebrows is
known, even when they are occluded, and we know the depth and geometry of every single
hair strand of the beard. If done well, there will be no issues about
the identity of the subjects, or the distribution of the data. Also, we are not limited by our wardrobe or
the environments we have access to. In this virtual world, we can do anything
we wish. So good! But of course, here is the ultimate question
that decides the fate of this project. And that question is: does this work? What is all this good for? And the crazy thing is that Microsoft’s
previous AI technique could now identify facial landmarks of real people, but, it has never
seen a real person before. How cool is that! But, this is a previous work, and now, a new
paper has emerged, and in this one, scientists at Microsoft said, how about more than 10
times more landmarks? Yes, this new paper promises no less than
700. When I saw this, I thought – are you kidding? Are we going 10x just one more paper down
the line? Well, I will believe it when I see it.

Let’s see a different previous technique
from just two years ago. You see that we have temporal consistency
issues, in other words, there is plenty of flickering going on here, and there is one
more problem: these facial expressions are really giving it a hard time. Can we really expect any improvement over
these two years? Well, hold on to your papers and let’s have
a look at the new method and see for ourselves. Look at that! It not only tracks a ton more landmarks, but
the consistency of the results has improved a ton as well. So, it both solves a harder problem, and it
also does it better than the previous technique. Wow! And all this just one more paper down the
line. My goodness. I love it! I feel like this new method is the first that
could even track Jim Carrey himself. And, we are not done yet! Not even close – it gets even better! I was wondering if it still works in the presence
of occlusions, for instance, whenever the face is covered by hair or clothing, or a
flower. And, let’s see. It still works amazingly well! What about the colors? That is the other really cool thing – it can,
for instance, tell us how confident it is in these predictions.

Green means confident, red means that the
AI has to do more guesswork, often because of these occlusions. My other favorite thing is that this is still
trained with synthetic data. In fact, it is key to its success. This is one of those success stories where
training an AI in a simulated world can be brought into the real world, and it still
works spectacularly. There is a lot of factors at play here, so
let’s send out a huge thank you to computer graphics researchers as well for making this
happen. These virtual characters could not be rendered
and animated in real time without decades of incredible graphics research works. Thank you! And now comes the ultimate question: how much
do we have to wait for these results? This is incredibly important. Why? Well, here is a previous technique that was
amazing at tracking our hand movements. Do you see these gloves? Yes? Well, those are not gloves. This is how a previous method understands
our hand motions, which is to say, that it can reconstruct them nearly perfectly. Stunning work. However, these are typically used in virtual
worlds, and we had to wait for nearly an hour for such a reconstruction to happen.

Do we have the same situation here? You know, 10x better results in facial landmark
detection, so what is the price that we have to pay for this? One hour of waiting again? Well, not at all! If you have been holding on to your papers,
now, squeeze that paper, because it is not only real time, it is more than twice as fast
as real time. It can churn out 150 frames per second and
it doesn’t even require your graphics card, it runs on your processor. That is incredible. Here is one more comparison against the competitors. For instance, Apple’s ARKit runs on their
own iPhones, and thus, they can make use of the additional depth information. That is a goldmine of information. But, this new technique doesn’t, it just
takes color data, that is so much harder, but in return, it will run on any phone.

Can these results compete with Apple’s solution
with less data? Let’s have a look. My goodness, I love it. The results seem at the very least comparably
good. That is, once again, amazing progress in just
one paper. So cool! Also, what I am really excited about is that
variants of this technique may also be able to improve the fidelity of these DeepFake
videos out there. For instance, here is an example of me becoming
a bunch of characters from Game of Thrones, this previous work was incredible because
it could even track where I was looking. Imagine a new generation of these tools that
is able to track even more facial landmarks, and democratize creating movies, games and
all kinds of virtual worlds. Yes, with some of these techniques, we can
even become a painting or a virtual character as well, and even the movement of our nostrils
would be transferred. What a time to be alive! So, does this get your mind going? What would you use this for? Let me know in the comments below! Thanks for watching and for your generous
support, and I'll see you next time!

Watch this on YouTube

Add Comment