4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities
vladbogo.substack.com
Today's paper presents an any-to-any vision model called 4M-21 that can handle tens of diverse tasks and modalities. It significantly expands the capabilities of existing multimodal models like 4M. Method Overview The method builds upon the multimodal masked modeling approach of 4M, where different modalities are first tokenized into sequences of discrete tokens using modality-specific tokenizers. In this way, the model can handle inputs in any modality and can generate/predict all modalities as shown below:
4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities
4M-21: An Any-to-Any Vision Model for Tens of…
4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities
Today's paper presents an any-to-any vision model called 4M-21 that can handle tens of diverse tasks and modalities. It significantly expands the capabilities of existing multimodal models like 4M. Method Overview The method builds upon the multimodal masked modeling approach of 4M, where different modalities are first tokenized into sequences of discrete tokens using modality-specific tokenizers. In this way, the model can handle inputs in any modality and can generate/predict all modalities as shown below: