Tech Giants Showcase Innovation: Google Launches Gemini 1.5, Meta Debuts Predictive Visual Machine Learning Model V-JEPA
Google Unveils Limited Version of Gemini 1.5 with Extended Context Window
In a flurry of artificial intelligence (AI) advancements, Google and Meta took center stage on Thursday, unveiling groundbreaking models set to redefine AI capabilities. Google’s announcement introduced Gemini 1.5, an updated AI model boasting enhanced long-context comprehension across various modalities.
Meanwhile, Meta made waves with the introduction of its Video Joint Embedding Predictive Architecture (V-JEPA) model, heralded as a pioneering non-generative method for advanced machine learning (ML) through visual media. These developments mark a significant leap forward in exploring the potential of AI technology. Notably, OpenAI also made strides in the AI landscape with the introduction of its inaugural text-to-video generation model, Sora.
The Gemini 1.5 model, spearheaded by Demis Hassabis, CEO of Google DeepMind, was unveiled through a comprehensive blog post. This next-generation model is built upon the robust Transformer and Mixture of Experts (MoE) architecture.
While various iterations of the model are anticipated, the current focus lies on the release of the Gemini 1.5 Pro model, available for early testing. Hassabis highlighted that this mid-size multimodal model exhibits performance capabilities akin to Google’s flagship Gemini 1.0 Ultra model, which stands as the company’s largest generative model. Currently, the Gemini 1.0 Ultra model is accessible through the Gemini Advanced subscription bundled with Google One’s AI Premium plan.
The biggest improvement with Gemini 1.5 is its capability to process long-context information. The standard Pro version comes with a 1,28,000 token context window. In comparison, Gemini 1.0 had a context window of 32,000 tokens. Tokens can be understood as entire parts or subsections of words, images, videos, audio or code, which act as building blocks for processing information by a foundation model. “The bigger a model’s context window, the more information it can take in and process in a given prompt — making its output more consistent, relevant and useful,” Hassabis explained.
Alongside the standard Pro version, Google is also releasing a special model with a context window of up to 1 million tokens. This is being offered to a limited group of developers and its enterprise clients in a private preview. While there is no dedicated platform for it, it can be tried out via Google’s AI Studio, a cloud console tool for testing generative AI models, and Vertex AI. Google says this version can process one hour of video, 11 hours of audio, codebases with over 30,000 lines of code, or over 7,00,000 words in one go.
“For example, if the model needs to be able to distinguish between someone putting down a pen, picking up a pen, and pretending to put down a pen but not actually doing it, V-JEPA is quite good compared to previous methods for that high-grade action recognition task,” Meta said in a blog post.At present, the V-JEPA model only uses visual data, which means the videos do not contain any audio input. Meta is now planning to incorporate audio alongside video in the ML model. Another goal for the company is to improve its capabilities in longer videos.