The instinct to swap models
From my experience, when something isn't working in an AI product, the first thing people want to do is to try a different model. I’ve heard it in job interviews and from friends discussing their next startup idea: “What model would you use?”. It makes sense if the model is the thing that does the work, right? And if the output isn’t good, maybe you just need a better one. But, in my personal experience, this instinct has been mostly wrong.
Understandably, model choice feels like the core decision. I’m not saying models don’t matter at all, but when things don’t work, that’s rarely the root cause. Most of the time, you will find that switching models does not get you anywhere. Or maybe you do feel like it is better, but can’t quite put it into words ( vibe testing anyone? ). That is usually when I get asked, “What model would you use?”.
Where does this instinct come from?
In academic research where one needs to push the edge of knowledge, new models have always been very important. An existing dataset along with the corresponding metrics is usually chosen at the start of a research project, and then the researchers attempt to improve the scores on these so-called benchmark datasets. I did that a couple of times, too. I was more interested in what my model did on ImageNet than how well it worked for real problems out in the real world. There is a strong research incentive to publish a paper with the next SOTA (state-of-the-art) model, and so, I, like most people I knew, focused on that.
That’s what research is optimized for. OpenAI, since its inception, has had a strong research focus, and it shows. Besides, they are in the business of selling models, so they need to tell you why their model is better. They don’t market “spending a week curating a dataset and tuning a prompt.” Also, it’s not their problem if your prompt just doesn’t work and you don’t know where to go next. You are using their model, and you are building things with it after all.
What those companies actually do
The thing is, foundation models like GPTs weren’t trained for your product. They were optimised to do well on certain tasks, which include being very good at being ChatGPT. Of course, they don’t work “out of the box” for everything; it’s against their design.
Instead of using AI companies’ press statements to decide which model to use, you can borrow their strategy for model improvements. What you should learn from their releases is not “GPT-4 is better than GPT-3.5,” it’s how they test, validate, and improve upon their existing models.
Once you change your perspective, you will notice that their model improvements come from careful benchmarking, highly curated datasets, and evaluation-guided iterations. They say things like “On SWE-bench Verified, a measure of real-world software engineering skills, GPT‑4.1 completes 54.6% of tasks, compared to 33.2% for GPT‑4o (2024-11-20).”1 They also share real-world examples, and notice how similar it is even though the benchmarks change: “Their users noted that it was 30% more efficient in tool calling and about 50% less likely to repeat unnecessary edits or read code in overly narrow, incremental steps.” You will notice how they have whole teams dedicated to coming up with new datasets for evaluating new behaviours.2
What you should do instead of changing the model
First, you need to understand your goal and craft datasets where you can test if this goal is achieved. Just like OpenAI does. You need feedback in actual numbers and not in vibes. Just like OpenAI has. They are telling you how they made their model better, and you should listen and use it to make your own AI features better. This is not new information. In my personal opinion, OpenAI is a feat of engineering more than it is a feat of research3. There is no reason why you can’t build an evaluation system and an iteration loop adapted to your particular use case.
The how may not be obvious at this point. But I hope you agree at this point that sounds like a safe path. If you are thinking of changing the model, start with evaluation: ask what’s broken, how to measure it, and how to fix it systematically.
I hope to expand on that here, but for now, here are some nice resources I found:
And next time you read an announcement like “Introducing GPT-4.1 in the API”, I hope you get inspired to improve your own AI evaluation system.
Then let me know if changing the model turned out to be the right decision.
https://openai.com/index/gpt-4-1/
https://huggingface.co/datasets/openai/graphwalks
https://openai.com/index/ai-and-compute/