"MAGVIT: Masked Generative Video Transformer"

"MAGVIT: Masked Generative Video Transformer"
Image from https://openaccess.thecvf.com/content/CVPR2023/papers/Yu_MAGVIT_Masked_Generative_Video_Transformer_CVPR_2023_paper.pdf

#ComputerVision #GenerativeVideo #Transformers #MaskedGit

Introduction



With the advances of technologies for images and video generation, researchers offered a new model of video generation based on vision transformers.

MAGVIT became the new state-of-the-art in the domain of generative video with its release on 10th December 2022.

It is faster than other models and can do multiple tasks like Frame Prediction or Frame Interpolation with one single model.



The Transformers are some models developped originally to solve sequence transduction. They were really good when it comes to speech recognition or text-to-speech transformation. After showing their performances in those domains, researchers tried to use it on images. We call it Vision Transformers.

MAGVIT overview

The MAGVIT model is a multitask model which can perform Frame Prediction, Frame Interpolation, Outpainting, Inpainting, Class-conditionnal Generation and some other tasks.

It takes one or multiple images (based on task) as input and generates video.

We can define different vocabulary:

  • Encoder : Neural network that transforms any type of sequence to another sequence of numbers. For example, it will transform this phrase or an image into numbers.
  • Decoder : Neural network that transforms a sequence of number into any type of sequence.

Architecture

The MAGVIT is based on 2 parts with autoregressive transformers:

The COMMIT Method

The MTM was good but not flexible and there was a lack of generalization. Then, they implemented a new method : The Conditional Masked Modeling by Interior Tokens (COMMIT). It first embeds the image by the VQ then it changes the token into conditional token, masking it or keeping it based on conditional function.

This lets the model generalize more.
Capture-d--cran-de-2023-10-19-09-47-03

Image from https://openaccess.thecvf.com/content/CVPR2023/papers/Yu_MAGVIT_Masked_Generative_Video_Transformer_CVPR_2023_paper.pdf

Results and Conlusion

Capture-d--cran-de-2023-10-19-09-50-32
Image from https://openaccess.thecvf.com/content/CVPR2023/papers/Yu_MAGVIT_Masked_Generative_Video_Transformer_CVPR_2023_paper.pdf

As we can see above, MAGVIT is performing well on frame prediction. It is the best for multiples tasks. It is at the state of art for all the tasks we speak about.

In conclusion, the MAGVIT is: