How to run a local coding agent with Gemma 4 and Pi

pi-agent

I've been playing around with running coding agents fully locally. The setup I landed on is:

LM Studio + Pi agent + Gemma 4 26B A4B (Q4_K_M)

Gemma 4 running in LM Studio, connected to Pi as the terminal agent. It works surprisingly well, and this post walks through how to set it up.

Here's what we'll cover:

Install LM Studio
Download Gemma 4
Start a local server
Configure context size
Install Pi
Connect Pi to your local model
Add skills and extensions

1) Install LM Studio

You need something to serve the model locally. I'm using LM Studio here — it's a desktop app that handles model downloads, quantization, and exposes a local OpenAI-compatible API server. Download it from lmstudio.ai (macOS, Windows, Linux).

Ollama and llama-server (part of llama.cpp) work just as well if you prefer a CLI-first workflow. All three expose an OpenAI-compatible endpoint, so Pi doesn't care which one you use.

The rest of this guide uses LM Studio, but the Pi configuration works with any of them — just swap out the server configuration.

2) Download Gemma 4

Gemma 4 is Google's latest open-weight model family, released under the Apache 2.0 license. Compared to earlier Gemma versions, it's a real step change for coding and agentic use cases — it now has native function calling, system prompt support, and thinking modes, which makes it a genuinely good model for local coding agents. The family includes four sizes:

Model Size	Architecture Type	Context Length
Gemma 4 E2B	Dense	128K tokens
Gemma 4 E4B	Dense	128K tokens
Gemma 4 26B A4B	Mixture of Experts (MoE)	256K tokens
Gemma 4 31B	Dense	256K tokens

My recommendation: go with the 26B A4B. It's a Mixture-of-Experts model, which means it has 26B total parameters but only activates 4B per token. In practice, you get the quality of a much larger model with inference speeds closer to a small one. It handles text, image understanding, function calling, and thinking modes — which is exactly what you want for a coding agent.

That said, the E4B is surprisingly capable for its size. If you're short on VRAM, it's worth trying — but it does need more guidance and more specific prompts to get good results.

To download it, open LM Studio, search for gemma-4-26b-a4b, and download a quantized GGUF version (e.g., Q4_K_M). Choose the quantization based on your available VRAM:

Quantization	Download Size	Quality
Q4_K_M	18 GB	Good balance
Q6_K	24 GB	Higher quality
Q8_0	28 GB	Near-original

gemma4

Note: Even though the model only activates 4B parameters per token, all 26B parameters must be loaded into memory for fast routing. That's why VRAM requirements are closer to a dense 26B model.

If you're on a Mac, you can also check out the MLX versions of Gemma 4. MLX is natively optimized for Apple Silicon and can be faster than the GGUF format on M-series chips.

3) Start the server in LM Studio

Once the model is downloaded:

Go to the Developer tab in LM Studio
Select your downloaded Gemma 4 model
Click Start Server

server1

The server runs at http://localhost:1234 by default and exposes an OpenAI-compatible API.

You can verify it's running:

curl http://localhost:1234/v1/models

4) Configure context size and GPU offload

Before you start working, check the context size and GPU offload settings under Model Settings in the Developer tab.

Context size directly impacts VRAM usage. The model supports up to 256K tokens, but you probably don't need all of that for coding tasks. More context = more VRAM on top of the base model weights.

Use Case	Context Size	Additional VRAM (approx.)
Small edits, single files	16K	~1 GB
Standard coding sessions	64K	~4 GB
Multi-file refactors	128K	~8 GB
Full repo context	256K	~16 GB

I'd recommend going with 128K if your VRAM allows it. Coding agents tend to accumulate a lot of context over a session — file contents, tool outputs, conversation history — and running out of context mid-task is annoying.

Pi has built-in session management that helps here. /compact summarizes older messages to free up context. /new starts a fresh session. /tree lets you navigate the session history and jump back to any previous point. /fork creates a new session from a past message, which is great when you want to branch off in a different direction without losing your history.

Also check the GPU Offload setting. This controls how many layers are loaded onto the GPU vs. kept in system RAM. More layers on GPU = faster inference, but requires more VRAM. If your GPU can't fit the entire model, LM Studio will split it between GPU and CPU — it'll still work, just slower for the CPU portion. I keep this at maximum (30 for the 26B A4B).

If you're running into out-of-memory issues, lower the context size first.

server2

5) Install Pi

Pi is a minimal terminal coding harness by Mario Zechner. The core is deliberately small — the model gets four tools (read, write, edit, bash) and that's it.

You can customize it with extensions, skills, prompt templates, and themes. It's also token efficient and the system prompt is small, so you can do actual context engineering. That matters a lot when you're running a local model.

npm install -g @mariozechner/pi-coding-agent

6) Configure the local model in Pi

Create (or edit) the file ~/.pi/agent/models.json to point Pi at your local LM Studio server:

{
  "providers": {
    "lmstudio": {
      "baseUrl": "http://localhost:1234/v1",
      "api": "openai-completions",
      "apiKey": "lm-studio",
      "models": [
        {
          "id": "google/gemma-4-26b-a4b",
          "input": [
            "text",
            "image"
          ]
        }
      ]
    }
  }
}

Note: Set the model id to match the exact model name shown in LM Studio's server tab.

Then launch Pi and select your local model:

pi
# Use /model to switch to your local LM Studio model

That's it. You now have a local coding agent running entirely on your machine.

pi-start

7) Skills

Skills are on-demand capability packages that extend what Pi can do. They follow the Agent Skills standard and are just Markdown files with instructions.

Install community skills via git:

# User-level (available in all projects)
git clone https://github.com/badlogic/pi-skills ~/.pi/agent/skills/pi-skills

# Or project-level
git clone https://github.com/badlogic/pi-skills .pi/skills/pi-skills

Some skills I find useful:

liteparse: Fast local document parsing (PDFs, DOCX, PPTX and more). Especially handy with Gemma since it can only understand images — liteparse converts documents to a format the model can actually work with.
frontend-slides: Create presentation slides in HTML.
pi-skills: A collection of skills for pi-coding-agent.
grill-me: Get grilled to work out and iterate on an idea.
gemini-skills: Skills for the Gemini API, SDK and model interactions.

Invoke a skill during a session with /skill:name, or let the agent discover and use them automatically.

8) Extensions

Extensions are TypeScript modules that go deeper — custom tools, commands, UI components, permission gates, even sub-agents.

One thing to know: Pi runs YOLO by default. It will execute bash commands without asking. That's fast but can be risky, especially with a local model that might hallucinate a destructive command. The permission-gate extension helps — it prompts you for confirmation before running potentially dangerous commands. It's not a full security sandbox though. If you want something more robust, check out cco (runs commands in a container) or the sandbox extension.

# Load an extension with --extension flag
pi --extension examples/extensions/permission-gate.ts

# Or copy to extensions directory for auto-discovery
cp permission-gate.ts ~/.pi/agent/extensions/

That's the full setup. Once it all clicks, it's a surprisingly capable workflow — and it's nice knowing everything runs on your own hardware. Happy local building!

Resources

Acknowledgements

Thanks to my colleague Ian who helped me find a great setup. He also created a similar video guide showing how to set up Gemma 4 with LM Studio & OpenCode.