Why AI Evaluation Systems Fall Flat

And the Case for Iterating on Your Evaluation Framework

Jun 27, 2025

In my experience, AI evaluation systems often fail because teams give up on them, choosing instead to make decisions on models and prompts based on ‘gut feelings’ and 'vibes.' The common, if flawed, conclusion is that these systems are useless. This leads to evaluation being deprioritised or even abandoned because it feels slow, complex, and disconnected. But I believe the actual issue isn't that the system lacks value, but that it wasn't capturing the necessary 'vibes.' The solution isn't to ignore the whole system, but to improve upon it.

Why is Evaluation Ignored?

The Engineer's Perspective. Maybe you're like me; you train models in some sense of the word, and you probably agree that evaluation is very important. However, I've had moments where I introduce a new metric or dashboard and then completely abandon it, forgetting it even exists. Weeks later, a problem pops up, and those very metrics could've helped. On the flip side, I've also seen a "boring" metric, after the fifth look in a week, reveal something genuinely fascinating. Setting up evaluation tends to be the fun part, but looking at it every day is a hassle. Why not just wait for things to break in a more visible way?

The Leadership Perspective. Maybe you need the AI to "work," and it’s not. You probably feel like there's no time to build an evaluation system when the core issue is that the AI fails to perform as expected. You see all these evaluation systems out there, but you're not getting any real value, and everything still feels broken. Plus, from the outside, it looks like the AI engineers are just over-engineering, doing stuff they enjoy, instead of focusing on getting it functional. "Let's just test it manually," you might say, trying to help.

And you know what? These feelings aren't wrong. Sometimes, evaluation is a time sink. It is often not implemented correctly. But even when it is, it can take a while before you see the payoff.

Where Evaluation Can Go Wrong

1. Siloed Evaluation

Traditional machine learning often left evaluation solely to engineers, relying on well-established metrics. But in today's AI landscape, particularly with large language models, metrics only make sense when they capture the product's "vibes". I think product folks should be the ones writing your LLM-as-a-judge1 prompts and making decisions on whether the AI is good enough or not. Engineers typically have less product context than a product manager. While supervised ML sometimes allowed us to bypass this, AI models are inconsistent, and there isn't just one "right" answer. This means non-engineers (or even users) need to be involved in evaluating the product by proxy.

2. Choosing the wrong metrics

One of the most common issues is simply using the wrong metrics for the job. A frequent example is trying to mark an LLM output as merely "good" or "bad" when the actual project goal is to improve nuanced responses. Another pitfall is borrowing standard academic metrics, such as BLEU, when what the company truly cares about is user satisfaction or specific business outcomes. The right metric should directly align with your business or user goal, not just an academic benchmark.

3. Giving Up Too Soon

I've touched on this already, but if we drop the evaluation framework every time something doesn't work, we'll never get anywhere. People often give up too fast. A good evaluation system represents 90% of the work, but it's easy to under-prioritise it over other demands, especially in startups where new features might feel like the absolute top priority. The result is an evaluation system that's not used, leading to the conclusion that it's useless. It's a self-fulfilling prophecy.

4. Lack of Clear Iteration Paths

Even a seemingly perfect evaluation system can leave us stumped about what to do next. Users might not like an answer, but how do we make it better? Does it need more prompt engineering or a better underlying model2? This is the next hard problem to solve once a decent setup is in place. A well-designed evaluation setup should, in fact, make this problem much easier to address. I'll cover this in a future post, but feel free to reach out if you'd like my advice on your specific problem.

5. Overlooking the Complexity of Evaluation Data

For good AI evaluation, you need the right data. And it's not just about getting people to label stuff; sometimes, just building the dataset to label itself is a huge pain. You have to define your goal clearly beforehand so you know where to start. Then you need to figure out how to get the data (do you have everyone on the team write things down, or do you hire freelancers?). And at evaluation time, how do you sample it so it's not biased towards what Engineer A or Product Manager B cares about? This complexity often leads teams to perform manual, one-off tests, which do not scale.

Don't Just Evaluate, Iterate!

Generated image: "Me, trying to 'iterate' without a proper evaluation system. Just a dizzy cockroach in a loop!" — Me, trying to 'iterate' without a proper evaluation system. Just a dizzy cockroach stuck in a loop!

Teams often build an evaluation framework once and then just iterate on the core system, like the model or the prompt. But unlike those core system components, a traditional ML diagram doesn't put evaluation at the centre3. It treats it like just another building block you set up once and forget. The truth is, you should continually iterate on the evaluation framework itself. Luckily, this also means you can build it much sooner than expected. You don't need a full-fledged, fancy Langsmith dashboard if your product isn't there yet. Sometimes, a well-crafted set of a few examples from the product team for a particular functionality can do wonders to make sure your engineers are vibe-evaluating the same way. It brings everyone onto the same page, even if that page is still a work in progress.

Evaluation doesn't have to be perfect; it can just be iterative. Constantly changing metrics you track or evolving your LLM-as-a-Judge prompts are not signs of failure, but actually examples of a successful AI evaluation system. A system that is built once and never updated will never be successful. Just as models and the data they are trained on are imperfect and constantly evolving, your approach to evaluating them will also be imperfect and continually evolving.

Good evaluation systems are:

Designed with direct product feedback in mind. They start from what the user and business care about, and they evolve as requirements change.
Collaborative, bringing engineers, product managers, and even users together. Feedback from different stakeholders guides the changes in the evaluation system.
Logged, visualised, and revisited often, not just living in someone's head.

Ultimately, evaluation is what transforms vague user feedback into concrete system improvement. Without it, everything you do is based purely on intuition or guesswork. But with a solid, evolving evaluation system in place, you gain:

Measurable iteration: You know if your changes are improving things.
Early warning signals: You catch problems before they blow up in production.
Confidence in what you ship: You have data to back up your decisions.

It takes time to see the full gains, but embracing this iterative approach to evaluation is how you truly build better AI products. Don’t give up because your first attempt didn’t work.

And if you need some help with it, feel free to reach out.

Here is a guide if you don’t know what that is: https://www.evidentlyai.com/llm-guide/llm-as-a-judge

Probably not