Using Ollama and Claude CLI as your interface, you can work with large cloud AI models like Nemotron 3 Nano, GLM 4.7, and Kimi‑K2 while your own machine stays fast and cool. [Ollama as a gateway and Claude Code CLI](https://carmelyne.com/the-ultimate-guide-move-ollama-model-storage-to-an-external-drive-on-macos-m1-m2-friendly/)

You don’t need a GPU‑heavy machine to use large language models.

With Ollama acting as a local gateway and the Claude Code CLI as your interface, you can talk to cloud‑hosted open models while your laptop stays cool and responsive.

Your computer becomes the control panel. The cloud does the heavy lifting.

![Ollama and Claude CLI](/content/uploads/2026/02/Cursor_and___Base_Claude-over-Ollama_wrapper_%E2%80%A2_Untitled-1_%E2%80%94__zsh_d.png)

***

## What This Ollama and Claude CLI Setup Actually Does

Your laptop → Claude CLI → Ollama → Cloud inference

The models are **not running on your machine**. They are hosted remotely, and Ollama forwards your requests to them.

This means:

- No GPU upgrade required
- No RAM overload
- No laptop overheating
- Access to large, capable models anyway

***

### 1. Install Ollama

Download and install Ollama from:

<https://ollama.com>

Verify installation:

```bash
ollama --version
```

***

### 2. Confirm You Can Run Cloud Models

Ollama supports cloud‑hosted variants of some open models.
Many Ollama cloud models use `:cloud` or `-cloud` style suffixes, but you should always use the exact model name shown in Ollama’s library:

```bash
ollama run <model-name>:cloud
```

Examples:

```bash
ollama run glm-4.7:cloud
ollama run kimi-k2.5:cloud
ollama run nemotron-3-nano:30b-cloud
ollama run devstral-small-2:24b-cloud
ollama run minimax-m2.1:cloud
ollama run gemini-3-pro-preview:latest
ollama run gemini-3-flash-preview:latest
ollama run gpt-oss:120b-cloud
ollama run gpt-oss:20b-cloud
```

If the model starts without downloading gigabytes locally, you’re using cloud inference correctly.

***

### 3. Install Claude Code CLI

You’ll use the Claude Code CLI as your main interface for chatting with and coding alongside these models.

#### Install via npm

```bash
npm install -g @anthropic-ai/claude-code
```

Verify installation:

```bash
claude --version
```

You do **not** need to log in with a paid Anthropic key for this setup, since we’ll route requests through Ollama instead.

This setup becomes much more powerful when you use the Claude CLI as your daily interface.

Install it using the official instructions from Anthropic. Once installed, you’ll be able to point Claude at Ollama instead of Anthropic’s hosted models.

***

### 4. Create a Clean Ollama Wrapper for Claude

Instead of repeating environment variables every time, define a helper alias or function in your shell config (`.zshrc` or `.bashrc`).

#### Base wrapper

```bash
alias claude-ollama='ANTHROPIC_AUTH_TOKEN=ollama \
ANTHROPIC_BASE_URL=http://localhost:11434 \
ANTHROPIC_API_KEY="" \
claude'
```

This tells Claude CLI to:

- Send requests to Ollama
- Use Ollama as the “auth provider”
- Skip real Anthropic API keys

***

### 5. Create Model Shortcuts (Much Cleaner)

We use aliases instead of changing Claude’s global configuration so your **default Claude setup stays untouched**. This lets you switch between standard Claude usage and Ollama‑routed cloud models without breaking anything or rewriting config files.

Now you can make readable aliases for each cloud model:

```bash
alias glm='claude-ollama --model glm-4.7:cloud'
alias kimi-k2='claude-ollama --model kimi-k2.5:cloud'
alias nemo='claude-ollama --model nemotron-3-nano:30b-cloud'
```

Run them in the terminal:

```bash
glm
kimi-k2
nemo
```

And Claude CLI will talk to those cloud‑hosted models through Ollama.

![GLM 4.7 on Claude CLI](/content/uploads/2026/02/glm-carmelyne.png)

***

### 6. These Cloud Models Can Use Tools Too

Because you’re accessing them through the Claude Code CLI, these cloud‑inferenced models aren’t limited to plain chat. They can use the same developer tools Claude normally can, such as:

- Reading and writing files
- Running shell commands like `grep`
- Performing web searches (when enabled in your CLI setup)

So even though the model is running remotely, it can still act locally through Claude’s tool layer. Your laptop becomes the execution environment, while the cloud model provides the intelligence.

This is what makes the setup powerful: lightweight hardware, full developer‑agent capabilities.

***

### 7. What You Just Built

You now have a hybrid AI setup:

- Local machine → lightweight routing
- Cloud GPUs → heavy reasoning

Your laptop stays fast and quiet while you still get access to large‑scale models.

This is especially useful if you:

- Use a MacBook Air or older laptop
- Work while traveling
- Don’t want to maintain a local GPU setup

***

## Free Tier and Paid Limits

Ollama offers a Free tier, with paid plans such as Pro at $20/month and Max at $100/month if you need higher limits.

You’re essentially renting bursts of supercomputer time only when you need it.

***

## Cloud Model Reference Table

<table class="styled-table sortable">
<thead>
<tr>
<th>Model Name</th>
<th>Parameters</th>
<th>Core Focus / Capabilities</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>glm-5:cloud</strong></td>
<td>744B (40B active)</td>
<td>Advanced reasoning, systems engineering</td>
</tr>
<tr>
<td><strong>qwen3-vl:cloud</strong></td>
<td>235B+</td>
<td>State-of-the-art vision & thinking</td>
</tr>
<tr>
<td><strong>devstral-2:cloud</strong></td>
<td>123B</td>
<td>Codebase exploration, multi-file editing</td>
</tr>
<tr>
<td><strong>nemotron-3-super:cloud</strong></td>
<td>120B (MoE)</td>
<td>NVIDIA; complex multi-agent apps</td>
</tr>
<tr>
<td><strong>qwen3.5:cloud</strong></td>
<td>122B (Max)</td>
<td>Multimodal, vision, thinking, tools</td>
</tr>
<tr>
<td><strong>qwen3-next:cloud</strong></td>
<td>80B</td>
<td>High parameter efficiency & speed</td>
</tr>
<tr>
<td><strong>nemotron-3-nano:cloud</strong></td>
<td>30B</td>
<td>Optimized agentic workflows</td>
</tr>
<tr>
<td><strong>devstral-small-2:cloud</strong></td>
<td>24B</td>
<td>Specialized software engineering agents</td>
</tr>
<tr>
<td><strong>kimi-k2.5:cloud</strong></td>
<td>(N/A)</td>
<td>Multimodal agentic, instant/thinking modes</td>
</tr>
<tr>
<td><strong>minimax-m2.5:cloud</strong></td>
<td>(N/A)</td>
<td>Productivity, coding, high-efficiency</td>
</tr>
<tr>
<td><strong>gemini-3-flash:preview</strong></td>
<td>(N/A)</td>
<td>Frontier speed & multimodal reasoning</td>
</tr>
<tr>
<td><strong>deepseek-v3.2:cloud</strong></td>
<td>(N/A)</td>
<td>High efficiency, agent performance</td>
</tr>
</tbody>
</table>

***

### Summary

- **Ollama** handles routing.
- **Claude CLI** gives you a powerful developer interface.
- The **cloud** runs the big models.

Your laptop just orchestrates everything.

And that’s a very efficient way to work with modern LLMs.

***

## Quick Reference: Can I...? / Do I need...?

*A tactical checklist for scaling your inference without scaling your hardware.*

- **Can I use Ollama cloud without a GPU?** Yes. The model inference runs in Ollama’s cloud instead of on your laptop.
- **Can I use Claude Code with Ollama cloud?** Yes. Ollama supports Claude Code through an Anthropic-compatible API.
- **Do I need to pull models locally first?** Usually, yes. Pulling a supported cloud model makes it available to your Ollama install.
- **Can I disable cloud later?** Yes. Use `OLLAMA_NO_CLOUD=1` or disable cloud in `server.json` for a local-only setup.
- **What does Ollama say about data retention?** Ollama says its cloud does not retain your data.

**Summary:** This setup lets a lightweight laptop act as a practical control surface for much larger cloud-hosted models. 🌿✨

**END_OF_REPORT** 🌿✨
Model Name	Parameters	Core Focus / Capabilities
glm-5:cloud	744B (40B active)	Advanced reasoning, systems engineering
qwen3-vl:cloud	235B+	State-of-the-art vision & thinking
devstral-2:cloud	123B	Codebase exploration, multi-file editing
nemotron-3-super:cloud	120B (MoE)	NVIDIA; complex multi-agent apps
qwen3.5:cloud	122B (Max)	Multimodal, vision, thinking, tools
qwen3-next:cloud	80B	High parameter efficiency & speed
nemotron-3-nano:cloud	30B	Optimized agentic workflows
devstral-small-2:cloud	24B	Specialized software engineering agents
kimi-k2.5:cloud	(N/A)	Multimodal agentic, instant/thinking modes
minimax-m2.5:cloud	(N/A)	Productivity, coding, high-efficiency
gemini-3-flash:preview	(N/A)	Frontier speed & multimodal reasoning
deepseek-v3.2:cloud	(N/A)	High efficiency, agent performance