Izzo: Local AI just got a lot more practical on Macs.
Izzo: You're listening to Exploring Next, episode two-sixty-three. I'm Izzo, here with Boone, and today we're talking about Ollama's new MLX support.
Boone: Perfect timing too. With OpenClaw hitting three hundred thousand GitHub stars and everyone suddenly wanting to run models locally.
Izzo: Right, and I think people are finally hitting that wall with cloud costs. When you're paying Claude or ChatGPT subscription fees and still running into rate limits—
Boone: —you start looking at that M3 MacBook and wondering if it can actually run something decent locally.
Izzo: Exactly. So Boone, break down what MLX actually is and why it matters for this.
Boone: MLX is Apple's machine learning framework that's designed specifically for their unified memory architecture. Unlike traditional setups where you have separate CPU and GPU memory—
Izzo: —with all that copying data back and forth—
Boone: Exactly. Apple Silicon shares memory between CPU and GPU. MLX optimizes for that. So instead of hitting GPU memory limits, you're using the full system RAM.
Izzo: And that's huge because most people don't have gaming rigs with massive VRAM. But a MacBook Pro with 32 gigs? That's getting common.
Boone: Plus Ollama 0.19 adds support for Nvidia's NVFP4 compression and improved caching. They're attacking the memory problem from multiple angles.
Izzo: Okay but let's get specific. What can you actually run right now?
Boone: Currently just Qwen3.5 with thirty-five billion parameters. You need Apple Silicon and at least thirty-two gigs of RAM.
Izzo: So we're talking serious hardware requirements. That's not your base model MacBook Air.
Boone: No, but here's what's interesting—if you have the new M5 series, you get access to those Neural Accelerators. Better tokens per second and faster time to first token.
Izzo: Time to first token is huge for user experience. Nobody wants to wait ten seconds for the model to even start responding.
Boone: And this is where the unified memory really shines. Traditional GPU setups, you're constantly moving data. With MLX, the model lives in shared memory space.
Izzo: From a product perspective, this feels like the moment local models become viable for more than just hobbyists.
Boone: I mean, we're still not talking frontier model quality. But good enough for code completion, document analysis, basic reasoning tasks.
Izzo: Right, and privacy is becoming a real selling point. Especially for teams working on sensitive code or proprietary documents.
Boone: Though I have to say—and this is important—they specifically warn against OpenClaw-style setups that give models deep system access.
Izzo: Good point. Local doesn't automatically mean safe if you're giving it shell access and file system permissions.
Boone: The architecture choices here are really smart though. Instead of fighting against Apple's design decisions, they're leaning into them.
Izzo: What do you mean?
Boone: Most ML frameworks were built for discrete GPUs. Ollama with MLX says 'okay, unified memory is actually an advantage if we design for it.'
Izzo: That's clever. Work with the hardware, not against it.
Boone: Exactly. And I'm curious how this scales as they add more models. Qwen3.5 is just the start.
Izzo: Speaking of scaling—what should people actually try if they want to experiment with this? First, check if you have the hardware. Apple Silicon Mac, thirty-two gigs minimum. Then install Ollama and try the preview build. Command line tool, right? That's still the main barrier for less technical users. Yeah, though there are GUI wrappers popping up. But honestly, 'ollama run qwen3.5:35b' isn't that scary. Fair point. What else should people research? Look into MLX itself—Apple