Today's paper presents a method to enable large multimodal models (LMMs) to understand extremely long videos by leveraging the context length capabilities of the underlying language model. This is achieved through a method called "long context transfer" from text to vision.
Long Context Transfer from Language to Vision
Long Context Transfer from Language to Vision
Long Context Transfer from Language to Vision
Today's paper presents a method to enable large multimodal models (LMMs) to understand extremely long videos by leveraging the context length capabilities of the underlying language model. This is achieved through a method called "long context transfer" from text to vision.