Gemini 2.5 Pro: Great for Vibe Coding, Weak for Real Dev Work
After shipping agentic workflows and CLI tools for months, I stress-tested Gemini 2.5 Pro on actual dev work. Here’s where it cracked under pressure.
1. The Experiment Setup
I wanted to see if Gemini1 could handle a real developer workflow, not just greenfield snippets or brainstorming sessions. The test case: a content management system with a broken homepage flow that needed surgical fixes across multiple files.
I gave it:
Expectation: handle multi-file context, track changes across edits, stick to the plan, and execute like a competent coding assistant. This wasn’t asking for architectural decisions—just methodical bug fixing.

2. Where Gemini 2.5 Pro Broke Down
Tunnel Vision: Even when I explicitly mapped out the data flow (Utils fetches content → Controller processes → template renders), it failed to grasp how changes in one file affected others. It would fix the Utils function but ignore that the Controller was still calling the old method signature.
Dependency Blindness: Never verified that includes were actually included or that method calls matched updated signatures. It assumed everything “just worked” without checking the connections between files.
Context Drift2: Around task #8 of 12, something shifted. It started acting like it had hit a memory wall, repeating errors it had already diagnosed and solved, losing track of what we’d already fixed. The vaunted long context window felt more like a leaky bucket.
Regression Loops: This was the killer. Fix homepage content loading → detail pages break. Fix detail pages → homepage breaks again. We spent three hours in this cycle, with Gemini apologizing each time but unable to hold both fixes simultaneously in its working memory.
Empty Promises on Context: Google markets this as having massive context windows, but in practice it felt like context was constantly leaking. I’ve worked with other LLMs that genuinely hold multi-file state then Gemini wasn’t demonstrating that capability.
3. The “Spoon-Feeding” Experiment
Thinking maybe I was being too abstract, I got extremely specific. I provided:
Even with this level of hand-holding, execution remained brittle:
// What I needed: Simple homepage content load
$content = Utils::fetchContent('index');
// What it kept producing: Over-engineered loops that broke other pages
$articles = Utils::fetchAllContentMetadata();
foreach($articles as $article) {
// Complex filtering logic that wasn't needed
// and broke the detail page routing
}
The pattern was consistent: it couldn’t make surgical changes without introducing side effects elsewhere. No ability to cleanly walk back problematic changes or resume work after restarting the conversation.
4. The Reality Check
Frustrated, I took the exact same problem statement to another model. The difference was stark:
That interaction felt like collaborating with a senior developer who gets the bigger picture. The Gemini experience felt like managing a well-meaning intern who keeps breaking things while trying to help.
5. What Would Bring Me Back
For Gemini to earn a spot in my production workflow, it would need to demonstrate:
Genuine Long Context: Not just a large token window, but actual retention of decisions and dependencies across a multi-hour coding session.
Surgical Precision: The ability to make targeted changes without regressing unrelated functionality. Real codebases have interconnected parts and an AI needs to respect those connections.
Decision Persistence: When we agree on an approach or architectural decision, it should stick to that through the entire session, not drift into different patterns halfway through.
Rollback Intelligence: When something breaks, it should be able to identify what changed and cleanly revert without losing other progress.
3PRD Adherence: Following a structured plan without getting distracted by tangential improvements or abandoning the original scope.
6. The Bottom Line
Gemini 2.5 Pro4 has a place in the developer toolkit—it’s genuinely good for exploration, quick prototyping, and “what if we tried this approach” conversations. The creative coding vibes are solid.
But for structured, dependency-heavy development work where precision matters? It’s not ready. Too much context drift, too many regression loops, too much babysitting required.
I’ll happily revisit when future updates address these core execution issues. The potential is clearly there. But today, when I need to ship reliable fixes to production code, I’m reaching for tools that can actually hold the thread.
- Gemini
https://gemini.google.com/app ↩︎ - Concept Drift
https://en.wikipedia.org/wiki/Concept_drift ↩︎ - Notion’s How to Write a PRD
https://www.notion.com/blog/how-to-write-a-prd ↩︎ - Gemini 2.5 Pro
https://deepmind.google/models/gemini/pro/ ↩︎