News#gemma4 #ollama #open-source #agents #tutorial #langchain #agentic-ai #function-calling

Gemma 4 Is Out: Run a Full Agentic AI Stack on Your Laptop for Free (Apache 2.0, Ollama, Function-Calling)

Gemma 4 is Google's most capable open-weight model — Apache 2.0, native function calling, Extended Thinking, and edge deployment down to Raspberry Pi. The complete builder's guide to running agentic AI locally in 2026.

AI Builder ClubApril 6, 20269 min read

Google released Gemma 4 on April 2, 2026 — and it is the most consequential open-weight model drop of the year. Not because of benchmark scores (though those are remarkable), but because of what Gemma 4 makes possible for builders: a fully agentic, commercially-usable AI stack that runs entirely on your local hardware with zero cloud dependency.

This is the complete guide for AI builders who want to understand what Gemma 4 actually does differently, and how to put it to work in production agent systems.

What Is Gemma 4?

Gemma 4 is Google DeepMind's fourth generation open-weight model family, built on the same research architecture as Gemini 3. Four model variants cover the entire deployment spectrum — from a 2.3B parameter edge model that runs in under 1.5GB RAM on a smartphone, to a 31B dense flagship that ranks #3 globally among all open models.

Every variant ships under the Apache 2.0 license — the first time in the Gemma series. No monthly active user caps. No commercial restrictions. Full redistribution rights. This alone makes Gemma 4 the default answer for builders who need to ship commercial products without licensing headaches.

The Four Gemma 4 Models

Gemma 4 E2B (Edge 2B) 2.3B effective parameters, 5.1B total. Runs in under 1.5GB RAM using 2-bit/4-bit quantization. 128K context. Native text, image, audio, and video input. Reaches 133 prefill tokens/sec on a Raspberry Pi 5 CPU — and 3,700 tokens/sec on a Qualcomm Dragonwing NPU. This is the model for offline-first mobile and IoT agents.

Gemma 4 E4B (Edge 4B) 4.5B effective parameters. Designed for laptops and mobile workstations. 128K context. 3x faster Android inference than prior Gemma generation, 60% less battery drain. Includes audio input.

Gemma 4 26B MoE Only 3.8B parameters active per inference token (25.2B total). 256K context. Codeforces ELO 1,718. AIME 2026 score 88.3%. The world's most parameter-efficient capable model at this scale — you get near-flagship reasoning while activating one-seventh the compute. Best choice for developers with a modern gaming GPU.

Gemma 4 31B Dense (Flagship) 30.7B parameters, all active. 256K context. LMArena score 1,452 — #3 among all open models, #27 globally including closed APIs like GPT-5.4 and Claude. AIME 2026: 89.2%. Codeforces ELO: 2,150 (top 1% human coder territory). If you have an H100 or RTX 4090, this is your target.

What Makes Gemma 4 Actually Agentic

Prior open-weight models required significant prompt engineering gymnastics to call tools reliably. Gemma 4 is the first open-weight model family where agentic capability is a first-class design goal — not an afterthought.

1. Native function calling (no grammar hacks)

Gemma 4 outputs structured JSON tool-use calls without grammar constraints or special prompt engineering. The model was trained to understand tool schemas and produce valid, parseable calls consistently. This is what separates it from earlier models where function calling was essentially a prompt-engineering trick.

2. Extended Thinking mode

Enable chain-of-thought reasoning with enable_thinking=True. The model plans multi-step approaches before responding — useful for complex agent tasks where rushing to a tool call produces wrong results. You get the reasoning trace in the output, which makes debugging agent failures far easier.

3. Bounding box generation

Gemma 4 can output bounding box coordinates for elements in images. This is the key capability for computer-use agents — the model can look at a screenshot of a web page or desktop UI and identify exactly where to click, without any external vision model. This arrives out-of-the-box, no fine-tuning needed.

4. Multimodal function calling

The model can see an image, reason about it, and call an API — all in one inference pass. A Gemma 4 agent can look at a screenshot of a dashboard, identify a chart, extract the numbers, and call your data API to update a record. This is new.

5. Constrained decoding for reliable pipelines

LiteRT-LM (Google's on-device inference runtime) adds constrained decoding — structured, predictable output every time. Your production agent pipelines get reliable structured responses without having to parse and validate and retry.

Quick Start: Gemma 4 Agent with Ollama

If you have Ollama installed, pull the 26B MoE (best balance of quality and speed on a laptop GPU):

ollama pull gemma4:26b

For the flagship with more VRAM:

ollama pull gemma4:31b

For edge/laptop (CPU-only):

ollama pull gemma4:e4b

Building a ReAct Agent with Gemma 4 and LangChain

The ReAct (Reason + Act) pattern is the standard framework for agentic AI. Gemma 4 executes it reliably without prompt hacks. Here is a working multi-tool agent using LangChain and Ollama:

from langchain_ollama import ChatOllama
from langchain_core.tools import tool
from langchain.agents import create_react_agent, AgentExecutor
from langchain import hub

# Load Gemma 4 via Ollama
llm = ChatOllama(model="gemma4:26b", temperature=0)

# Define tools
@tool
def search_web(query: str) -> str:
    """Search the web for current information about a topic."""
    # Replace with your actual search implementation
    return f"Search results for: {query}"

@tool
def read_file(path: str) -> str:
    """Read the contents of a local file."""
    try:
        with open(path, 'r') as f:
            return f.read()
    except Exception as e:
        return f"Error reading file: {e}"

@tool
def write_file(path: str, content: str) -> str:
    """Write content to a local file."""
    with open(path, 'w') as f:
        f.write(content)
    return f"Written to {path}"

tools = [search_web, read_file, write_file]

Free AI Builder Newsletter

Weekly guides on AI tools & builder strategies.

Pull the standard ReAct prompt

prompt = hub.pull("hwchase17/react")

Create agent

agent = create_react_agent(llm, tools, prompt) executor = AgentExecutor(agent=agent, tools=tools, verbose=True, max_iterations=10)

Run

result = executor.invoke({ "input": "Research the latest developments in multi-agent AI systems and write a summary to agents_summary.txt" })

print(result["output"])


The `verbose=True` output will show you Gemma 4's full reasoning trace — each Thought, Action, and Observation step. This is Extended Thinking in practice.

## Native Function Calling with the Ollama Python SDK

For simpler tool-use without LangChain overhead:

```python
import ollama
import json

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_stock_price",
            "description": "Get the current price of a stock by ticker symbol",
            "parameters": {
                "type": "object",
                "properties": {
                    "ticker": {
                        "type": "string",
                        "description": "Stock ticker symbol, e.g. AAPL"
                    }
                },
                "required": ["ticker"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "send_slack_message",
            "description": "Send a message to a Slack channel",
            "parameters": {
                "type": "object",
                "properties": {
                    "channel": {"type": "string"},
                    "message": {"type": "string"}
                },
                "required": ["channel", "message"]
            }
        }
    }
]

def handle_tool_call(tool_name: str, args: dict) -> str:
    # Wire up your actual implementations here
    if tool_name == "get_stock_price":
        return json.dumps({"ticker": args["ticker"], "price": 182.45, "change": "+1.2%"})
    if tool_name == "send_slack_message":
        return json.dumps({"status": "sent", "channel": args["channel"]})
    return json.dumps({"error": "unknown tool"})

messages = [
    {"role": "user", "content": "Check the price of AAPL and send it to the #stocks channel"}
]

# Agentic loop
while True:
    response = ollama.chat(
        model="gemma4:26b",
        messages=messages,
        tools=tools
    )

    if response.message.tool_calls:
        messages.append(response.message)
        for call in response.message.tool_calls:
            result = handle_tool_call(call.function.name, call.function.arguments)
            messages.append({
                "role": "tool",
                "content": result
            })
    else:
        print(response.message.content)
        break

Gemma 4 will call get_stock_price, receive the result, then call send_slack_message — chaining tool calls automatically without you writing any orchestration logic.

Agent Skills on Android and iOS (On-Device, Zero Cloud)

Google launched Agent Skills in the AI Edge Gallery app (available on iOS and Android) — the first production application running multi-step agentic workflows entirely on-device using Gemma 4 E2B or E4B.

The framework processes 4,000 input tokens across 2 skill chains in under 3 seconds with GPU optimization. Skills are built as small tool definitions that plug into the LiteRT-LM runtime. Examples already shipping:

Wikipedia query skill: agent enriches its knowledge in real time without internet
Data visualization skill: converts user speech to charts and graphs
Multimedia skill: pairs photos with generated music
Workflow builder: complete multi-step app experiences through conversation

If you're building mobile AI, this is the path to agentic on-device experiences without CloudAI API costs or data leaving the user's phone.

When to Use Gemma 4 vs. Cloud APIs

Use Gemma 4 locally when:

Data cannot leave your infrastructure (healthcare, finance, legal, government)
You are building a commercial product and per-token API costs are a real P&L concern
Offline-first or air-gapped deployment is a requirement
You are prototyping and do not want to burn API credits during development
Your use case is latency-sensitive and round trips to a cloud API are too slow

Stick with cloud APIs (Claude, GPT) when:

You need best-in-class reasoning on genuinely hard, open-ended problems
Running complex multi-agent coordination with many concurrent agents
Your workload is sporadic and you prefer pay-per-use over running local inference hardware
You need capabilities Gemma 4 does not yet match: Claude's long document understanding, GPT's code interpreter

The honest answer for most AIBC builders: run Gemma 4 for high-volume or privacy-sensitive tasks, cloud APIs for complex reasoning and user-facing features where quality matters most. The 70/30 split between open and closed models is becoming standard practice.

Gemma 4 vs. Competing Open Models

Model             Active Params  Context  License                LMArena
──────────────────────────────────────────────────────────────────────────
Gemma 4 31B       30.7B          256K     Apache 2.0             #3 open
Gemma 4 26B MoE   3.8B active    256K     Apache 2.0             #6 open
Llama 4 Maverick  Large MoE      Long     Meta (700M MAU cap)    1,417 ELO
Mistral Small 4   6B active      Std      Apache 2.0             Below 31B

The Gemma 4 26B MoE is the standout for developer use: #6 globally among open models, activating only 3.8B parameters per token. It outperforms Gemma 3 27B on AIME 2026 by 4x (88.3% vs 20.8%) while costing one-seventh the compute. If you have a modern gaming laptop or desktop, this is your model.

The Ecosystem Advantage

Over 400 million prior Gemma downloads have produced more than 100,000 community-created variants in the "Gemmaverse" — fine-tuned models, quantized versions, domain-specific adaptations, and integrations across every major AI framework.

Day-one ecosystem support for Gemma 4 includes: Ollama, LM Studio, vLLM, llama.cpp, MLX, Hugging Face Transformers, Transformers.js, NVIDIA NIM, NeMo, SGLang, Baseten, and Unsloth Studio for no-code fine-tuning.

For fine-tuning your own variant under Apache 2.0: Vertex AI supports freezing the vision/audio towers and fine-tuning only the language head. Unsloth Studio provides a local no-code UI.

What This Means

Three things are true simultaneously right now:

The open-source capability gap is closing fast. Gemma 4 31B sits at #3 globally among open models and #27 overall including GPT-5.4 and Claude. For the majority of business AI use cases, the quality gap is no longer meaningful enough to justify the privacy and cost tradeoffs of cloud APIs.

Apache 2.0 is becoming the enterprise open AI standard. Both Gemma 4 and Mistral Small 4 launched under Apache 2.0 in the same window. Meta's community license is looking increasingly like an outlier that will drive developers toward Google and Mistral alternatives.

Agentic capability is now a first-class training requirement. Gemma 4's native function calling, Extended Thinking, bounding box generation, and multimodal tool use signal that the prompt-engineering workaround era for open models is ending. Models are being trained to be agents from the ground up.

The builders who figure out on-device and self-hosted agentic use cases in the next six months will define a category that nobody has named yet.

AI Builder Club is where we work on those use cases together.

Join AIBC and share what you build with Gemma 4.

FAQ: Gemma 4 for Builders

Can Gemma 4 run fully offline with no internet? Yes. The E2B model runs in under 1.5GB RAM using quantization — it fits on a Raspberry Pi, any modern smartphone, and any developer laptop. All weights are available for download from Hugging Face and Ollama. No cloud calls required at inference time.

What is the difference between Gemma 4's function calling and older open models? Older models required prompt engineering tricks or grammar constraints to produce valid tool-use JSON. Gemma 4 was trained with native function calling as a core capability — the model understands tool schemas and produces valid structured output consistently without special prompting.

Can I use Gemma 4 in a commercial product without paying Google? Yes. Apache 2.0 grants full commercial rights with no usage caps, no monthly active user limits, and no special agreements required regardless of your product's scale.

What GPU do I need to run the 26B MoE model? The 26B MoE activates only 3.8B parameters per token. Quantized to 4-bit, it fits comfortably in 8GB VRAM — meaning an RTX 3080 or better is sufficient. For the 31B dense model, you need 16-24GB VRAM (RTX 4090 or similar).

How does Gemma 4's Extended Thinking mode work? Pass enable_thinking=True in your inference call. The model generates a chain-of-thought reasoning trace before producing its final response. This trace is included in the output and makes debugging agent failures significantly easier.

Get the free AI Builder Newsletter

Weekly deep-dives on AI tools, automation workflows, and builder strategies. Join 5,000+ readers.

No spam. Unsubscribe anytime.

Go deeper with AI Builder Club

Join 1,000+ ambitious professionals and builders learning to use AI at work.

✓Expert-led courses on Cursor, MCP, AI agents, and more
✓Weekly live workshops with industry builders
✓Private community for feedback, collaboration, and accountability

See what's inside →

← Back to Blog