4M-21: An Any-to-Any Vision Model for Tens of…

Jun 18

Today's paper presents an any-to-any vision model called 4M-21 that can handle tens of diverse tasks and modalities. It significantly expands the capabilities of existing multimodal models like 4M. Method Overview The method builds upon the multimodal masked modeling approach of 4M, where different modalities are first tokenized into sequences of discrete tokens using modality-specific tokenizers. In this way, the model can handle inputs in any modality and can generate/predict all modalities as shown below:

Read →

0 Comments

AI Paper of the Day

4M-21: An Any-to-Any Vision Model for Tens of…