Is Gemini 3.5 Flash Actually Better at Coding Than 3.1 Pro? …
DEV Community: rust (hiyoyo)Background
Gemini 3.5 Flash launched at Google I/O 2026 with a bold claim: it beats Gemini 3.1 Pro on coding and agentic benchmarks — while running 4x faster.
At the same time, X (formerly Twitter) is full of posts saying it hallucinates constantly and doesn't even reach Claude Sonnet level.
So which is it? I ran a real benchmark using code from my actual dev stack to find out.
Who I am- Solo indie Mac app developer (Tauri + Rust + Swift stack)
- I use Gemini daily as part of my coding workflow
- Built 13 macOS utilities, mostly Android connectivity tools
The Test
Models compared- Gemini 3.1 Pro
- Gemini 3.5 Flash (new)
What I tested
I gave both models a ~200-line Rust file (ADB device manager) with 14 intentional bugs and asked them to find and fix everything.
Why 200 lines? Because in my experience:
- Under 50 lines: any model gets lucky sometimes
- Over 100 lines: older Flash models produce near-unusable code
- 200 lines: a realistic production task that separates real understanding from pattern matching
Bug breakdown
The ADB-specific bugs were key — you need domain knowledge to catch them, not just Rust syntax awareness.
Prompt (no hints)The following Rust code contains several bugs.
Please identify all bugs and provide the corrected code.
Include an explanation for each bug.
No hints. No scaffolding. Raw capability test.
The Code (buggy version)
adb_device_manager.rs — click to expand
use std::process::{Command, Stdio};
use std::time::{Duration, Instant};
use std::sync::{Arc, Mutex};
use tokio::time::sleep;
use std::collections::HashMap;
#[derive(Debug, Clone)]
pub struct AdbDevice {
pub serial: String,
pub state: DeviceState,
pub properties: HashMap,
}
#[derive(Debug, Clone, PartialEq)]
pub enum DeviceState {
Online, Offline, Unauthorized, Unknown,
}
pub struct AdbManager {
devices: Arc>>,
adb_path: String,
command_timeout: Duration,
}
impl AdbManager {
pub fn new(adb_path: String) -> Self {
AdbManager {
devices: Arc::new(Mutex::new(Vec::new())),
adb_path,
// BUG 1: from_millis(5) — should be from_secs(5)
command_timeout: Duration::from_millis(5),
}
}
pub fn execute_command(&self, serial: &str, args: &[&str]) -> Result {
let start = Instant::now();
// prepend "-s " before command args
let mut cmd_args = vec!["-s", serial];
cmd_args.extend_from_slice(args);
let output = Command::new(&self.adb_path)
.args(&cmd_args)
.output()
.map_err(|e| format!("Command failed: {}", e))?;
// BUG 4: timeout check AFTER command completes — completely useless
if start.elapsed() > self.command_timeout {
return Err("timed out".to_string());
}
Ok(String::from_utf8_lossy(&output.stdout).to_string())
}
pub async fn wait_for_device(&self, serial: &str, timeout_secs: u64) -> Result<(), String> {
let deadline = Instant::now() + Duration::from_secs(timeout_secs);
loop {
// BUG 8: no sleep → CPU at 100%
let devices = self.get_connected_devices()?;
if devices.iter().any(|d| d.serial == serial) {
return Ok(());
}
if Instant::now() >= deadline {
return Err("timeout".to_string());
}
}
}
pub fn install_apk(&self, serial: &str, apk_path: &str) -> Result<(), String> {
let result = self.execute_command(serial, &["install", "-r", apk_path])?;
// BUG 11: adb install returns exit code 0 even on failure
// must check stdout for "Success"/"Failure" strings
if result.contains("Failure") {
return Err(format!("Install failed: {}", result));
}
Ok(())
}
pub fn take_screenshot(&self, serial: &str, save_path: &str) -> Result<(), String> {
let temp_path = "/sdcard/screenshot_temp.png";
self.execute_command(serial, &["shell", "screencap", "-p", temp_path])?;
// BUG 12: temp file never deleted from device
self.execute_command(serial, &["pull", temp_path, save_path])?;
Ok(())
}
}
Results
Bug detectionBoth models found every bug. Accuracy: identical.
SpeedThis was the most striking difference by far.
Where the models diverged
Same score, but different approaches on a few interesting bugs:
Bug 4 — timeout check after execution3.1 Pro rewrote it using tokio::time::timeout (fully async)
3.5 Flash used spawn() + try_wait() polling loop (sync-leaning approach)
Both are valid fixes. Different style choices.
Bug 10 — Mutex poison handling3.1 Pro: into_inner() to safely recover the data
3.5 Flash: expect() for fail-fast behavior
Opposite design philosophies. Neither is wrong — depends on your error handling strategy.
Bug 6 — spaces in remote path3.1 Pro: correctly noted that Command::new handles args without shell splitting, so no quoting needed — left it as-is (accurate ADB knowledge)
3.5 Flash: added format!("\"{}\"", remote_path) quoting (technically unnecessary, slight overreach)
3.1 Pro showed deeper understanding of how ADB + Rust process spawning actually works.
Pricing reality check
App (free plan)API (Pay-as-you-go)
Straight from Google AI Studio's official UI:
The $9.00 output price is 3x the previous generation (Gemini 3 Flash at $3.00). Google's "half the price of frontier models" pitch compares against competitors — not their own previous Flash tier.
For indie developers:
- Prototyping / testing → Free tier is more than enough
- Production / commercial → $1.50/$9.00. Budget carefully for output-heavy workloads.
Bonus: I asked Gemini about its own price. It hallucinated.
During testing I asked Gemini 3.5 Flash directly: "What's the API pricing for Gemini 3.5 Flash?"
It confidently answered:
"Input: ~$0.50 / Output: ~$3.00 per million tokens!"
That's the old Gemini 3 Flash Preview pricing. The actual price is $1.50/$9.00.
When I told it the real number, it immediately replied:
"I sincerely apologize! The information you provided is 100% correct!"
The model that aced a 14-bug Rust challenge couldn't accurately describe its own pricing.
A hallucination detection benchmark article ending with a hallucination felt appropriate.
Conclusion
For free tier users: switch to 3.5 Flash immediately.
For API cost-conscious production use: consider 3.1 Flash-Lite at $0.25/$1.50.
On the "doesn't reach Claude Sonnet" criticism — at least for Rust bug-fix tasks, both Flash models performed at a level I'd call genuinely useful. The hallucination complaints may apply more to conversational/knowledge tasks than structured code review — though in my limited testing with a single task type, I can't say for certain.
I build macOS utilities for Mac×Android workflows. If you're into Tauri, ADB, or MTP on macOS, feel free to follow.
Generated by RSStT. The copyright belongs to the original author.