GELATO: The Frozen Towers Approach to Multimodal Embeddings
GELATO investigates extending a strong pre-trained text embedding model to handle multimodal data rather than training a new model from scratch. The text encoder remains frozen (the 'text tower') while separate modality-specific encoders are trained to align images, audio, or other modalities into the same embedding space. This 'frozen towers' strategy leverages existing text understanding and avoids retraining the core model. The blog post outlines the method and its motivation for efficient multimodal representation learning.