Apple Researchers Develop MM1: Multimodal AI Model Family with 30 Billion Parameters

Pre-Training Phase Underway for Apple’s MM1 AI Models: Researchers

Apple researchers recently published a pre-print paper detailing their efforts to develop a multimodal artificial intelligence (AI) large language model (LLM). Released on March 14 via an online portal, the paper outlines their success in imbuing the model with advanced multimodal capabilities, enabling it to process both text and image data. This significant advancement in AI aligns with Apple CEO Tim Cook’s earlier remarks during the company’s earnings calls, hinting at the imminent integration of AI features later this year.

The research paper, available on arXiv, an open-access repository for scholarly papers, provides insights into the development of MM1, a family of multimodal models boasting up to 30 billion parameters. Although the paper does not explicitly mention Apple, most of the researchers credited are affiliated with the company’s machine learning (ML) division, suggesting a strong connection to the tech giant.

Describing MM1 as a “performant multimodal LLM (MLLM),” the authors emphasize the careful selection of image encoders, vision language connectors, and other architectural components to create a model capable of processing both textual and image-based inputs. This strategic combination of components ensures the model’s proficiency in understanding and interpreting diverse types of data, enhancing its versatility and utility in real-world applications.

Apple researchers have shared their work on building a multimodal artificial intelligence (AI) large language model (LLM), in a pre-print paper. Published on an online portal on March 14, the paper highlights how it was able to achieve the advanced capabilities of multimodality and make the foundation model train on both text-only data as well as images. The new advancements in AI for the Cupertino-based tech giant come following CEO Tim Cook’s remarks made during the company’s earning calls where he said that AI features could arrive later this year.

 

 

The pre-print version of the research paper has been published on arXiv, an open-access online repository of scholarly papers. However, the papers posted here are not peer-reviewed. While the paper itself does not mention Apple, most of the researchers mentioned are affiliated with the company’s machine learning (ML) division, leading to the belief that the project is also affiliated with the iPhone maker.

As per the researchers, they are working on MM1, a family of multimodal models containing up to 30 billion parameters. Calling it a “performant multimodal LLM (MLLM), the authors of the paper highlighted that image encoders, the vision language connector, and other architecture components and data choices were made to create the AI model which is capable of understanding both text as well as image-based inputs.

While the breakthrough is significant, this research paper is not enough to ascertain that a multimodal AI chatbot will be added to Apple’s operating system. At this stage, it is difficult to even say whether the AI model is multimodal while taking inputs or in giving output as well (whether it can generate AI images or not). But if the results are confirmed to be consistent after peer review, it can be said that the tech giant has taken another big step towards building a native generative AI foundation model.