CM3leon is the first multimodal model trained on a recipe adapted from a plain text language model, including large-scale search augmentation pre-training. This process comprises a comprehensive retrieval-augmented pre-training stage and a subsequent multitask supervised fine-tuning (SFT) stage. The approach is straightforward yet yields a robust model, showcasing that tokenizer-based transformers can match the efficiency of existing generative diffusion-based models. Remarkably, despite employing five times less computational resources than previous transformer-based methods, CM3leon achieves cutting-edge performance in text-to-image generation.
The model combines the versatility and effectiveness of autoregressive models while maintaining cost-effectiveness and inference efficiency. Termed a causal masked mixed-modal (CM3) model, CM3leon can generate sequences of text and images conditioned on arbitrary sequences of image and text content, significantly surpassing the capabilities of previous models that focused solely on text-to-image or image-to-text generation.
According to the company, CM3leon’s capabilities enable its imaging tools to follow prompts more easily and produce more consistent images.
Compared to the most widely used image generation benchmark (Zero-Shot MS-COCO), CM3Leon achieves his FID (Frechet Inception Distance) value of 4.88, a new state-of-the-art technology in text-to-image generation established. Image model, party.
Additionally, the Meta said its generative AI CM3leon excels at various visual language tasks, such as visual question answering and long-form captioning.