Posts / artificial-intelligence

Open-Source Deep Research Is Closing the Gap, Slowly and Messily

Ohio State’s NLP group just released QUEST-35B, a Deep Research agent they trained on roughly 32 H100s with about 8,000 synthetic samples. They’ve open-sourced the whole thing: weights, training recipe, code, datasets. Benchmark results look competitive against some frontier systems. That’s a genuine achievement for an academic team.

The online reaction was predictably split. Some people were impressed. Others were immediately poking at the foundations: what exactly does it do, is the harness included, how brittle is it when you change the environment? Those are fair questions, not pedantry.

The “8k samples sounds tiny” reaction came up a lot. But someone who actually does a PhD on this stuff pointed out that these aren’t simple prompt-response pairs. They’re multi-step research traces involving search, verification, constraint satisfaction. Dense, in other words. The better analogy isn’t 8k flashcards; it’s 8k worked examples from someone who actually knows what they’re doing. That changes the calculation a bit.

Still, the same person was honest about the limits: don’t expect the model to generalise super well. Switch from a local retriever to a live web retriever, get blocked by a website, change the language, and the model can get completely lost. That gap between a research project and a real product is where things get humbling. I’ve seen enough software projects that look great on the demo and fall apart on contact with the actual world to have some appreciation for that distinction.

The base model here is a Qwen fine-tune, which prompted someone to note that Qwen is basically carrying consumer-level local models right now. That’s roughly accurate, and it’s a bit odd that a Chinese company’s open models are doing such heavy lifting for Western researchers and hobbyists. I don’t have a clean take on that. It’s worth noticing.

The broader point that keeps nagging at me is the gap between what frontier closed systems can do versus what an academic team with 32 H100s can reproduce. The raw intelligence gap is apparently narrowing. The brittle-to-harness-changes problem, the live web access problem, the proprietary evaluation data problem: those are harder to close from outside. They require infrastructure and iteration time, not just clever training recipes.

What genuinely interests me here is the cost trajectory. Frontier systems are rumoured to be running on models north of a trillion parameters. QUEST-35B is 35 billion. If you can get comparable Deep Research performance at a factor of thirty smaller, that’s not just cheaper to run. It’s runnable locally, which changes who gets to use it and how. I find that worth caring about even if the current version is brittle.

The environmental side of this I can’t fully resolve in my own head. More open-source replication of expensive training runs means more compute being spent on work that’s already been done somewhere else. That’s waste, in a real sense. At the same time, open-source access to capable models reduces dependency on a handful of closed providers and probably reduces the total inference compute required once a good small model exists. I hold both of those thoughts and I’m not going to pretend one cancels out the other neatly.

The GGUF someone posted for local use apparently has a structural issue: block 40 has zero tensors despite being declared in the file. So if you grabbed that and ran it, you were running something broken. Small reminder that “open-sourced everything” and “production-ready” are not synonyms. Which is fine. It’s a research artefact, not a shipping product.

Progress here is real. It’s also slower and messier than the hype suggests. Both things are true.