Google’s open-source Gemma is already a small model designed to run on devices like smartphones. However, Google continues to expand the Gemma family of models and optimize these for local usage on phones and laptops.
Its newest model, EmbeddingGemma, will take on embedding models already used by enterprises, touting a larger parameter than most and strong benchmark performance. EmbeddingGemma is a 300 million token parameter, open-source model that is best optimized for devices like laptops, desktops and mobile devices.
EmbeddingGemma, based on the Gemma 3 architecture, trained on more than 100 languages.
Min Choi, product manager, and Sahil Dua, lead research engineer at Google DeepMind, wrote in a blog post that EmbeddingGemma “offers customizable output dimensions” and will work with its open-source Gemma 3n model. It integrates with tools like Ollama, llama.cpp, MLX, LiteRT, LMStudio, LangChain, LlamaIndex and Cloudflare.
“Designed specifically for on-device AI, its highly efficient 308 million parameter design enables you to build applications using techniques such as Retrieval Augmented Generation (RAG) and semantic search that run directly on your hardware,” Choi and Dua said. “It delivers private, high-quality embeddings that work anywhere, even without an internet connection.”
The model performed well on the Massive Text Embedding Benchmark (MTEB) multilingual v2, which measures the capabilities of embedding models. It is the highest-ranked model under 500M parameters.

Mobile RAG
A significant use case for EmbeddingGemma involves developing mobile RAG pipelines and implementing semantic search. RAG relies on embedding models, which create numerical representations of data that models or agents can reference to answer queries.
Most RAG pipelines do not run on laptops or phones, but rather on cloud or on-premises instances. Building a mobile RAG pipeline enables information gathering and answering queries more directly on local devices. Employees can ask their questions or direct agents through their phones or other devices to find the information they need.
Interest in running AI applications natively on devices is growing, with tooling for creating mobile AI applications and more models running on devices, such as Liquid AI’s new LFM2-VL. Companies like Apple, Samsung and Qualcomm are competing to integrate hardware and software capable of running AI models on portable devices without sacrificing battery life.
Choi and Dua said that EmbeddingGemma works to create high-quality embeddings. They explained that RAG pipelines have two key stages: retrieving relevant context and generating answers based on that context.
“For this RAG pipeline to be effective, the quality of the initial retrieval step is critical. Poor embeddings will retrieve irrelevant documents, leading to inaccurate or nonsensical answers. This is where EmbeddingGemma's performance shines, providing the high-quality representations needed to power accurate and reliable on-device applications,” Choi and Dua said.
To do this, EmbeddingGemma introduced a method called Matryoshka Representation Learning. This gives the model flexibility, as it can provide multiple embedding sizes within a single model. For example, developers can use the full 768-dimension vector that EmbeddingGemma is capable of or truncate it to smaller sizes to be speedier.
Growing embedding model market
The growing use of RAG in enterprises has also led to increased interest in embedding models. In fact, EmbeddingGemma is not the only embedding model from Google. It released Embedding Gemini in July.
Cohere is onto its fourth iteration of an embedding model with Cohere Embed 4. France’s Mistral has Codestral Embed, OpenAI has Text Embedding 3 Large, and Qodo launched Qodo-Embed-1-1.5B.
The promise of bringing RAG and embedding it onto devices has excited some builders.
