The experimentation layer
for AI workflows

How do you know if any of the complicated stuff you're doing is meaningfully impacting the content you're generating? You don't. And neither do I.

But wouldn't it be interesting if we found out?

The problem

You care a lot about your enterprise AI workflows. Hell yeah, and same...but are they lit tho?

Nobody's testing any of this

Which documents should the model see? What instructions actually make the output better? Does that fancy integration you spent three weeks on do anything? There's no good way to find out. So people just pick a setup, ship it, and cross their fingers.

You can't read all of it

AI can generate an enormous amount of stuff very quickly. You cannot read an enormous amount of stuff very quickly. The hard part was never getting it to write things — it's figuring out if what it wrote is any good.

Your messy knowledgebase is perfect and its never done anything wrong in its life

Everyone thinks they need to clean up all their internal knowledge before AI can use it. But slow down there cowboy, sometimes the horse that gallops fastest is the first one to the cactus. You won't find the most impactful information in your business by reorganizing your google drive - you need to experiment at scale to figure out where your prorpietary knowledge actually moves the needle

The approach

FAFO at enterprise scale.

Data driven culture requires an experiment driven culture

Go beyond prompt swapping and start pulling all the levers you have available — what context the model gets, what instructions you gave it, which of your internal docs it's pulling from. Run a bunch of combinations side by side and understand the impact of your input on what's been generated

"Good" means different things to different people and that's actually so beautiful

Your compliance team and your marketing team will look at the same output and care about completely different things. That's fine — that's the point. Let them both define what they're looking for. The interesting part is where they disagree.

Test against a control, kind of like a scientist would

Every experiment includes a control: the model with nothing special added. Does your expensive RAG setup actually beat the model on its own? My sister (who has a PHD btw) was shocked to find out that we're not really doing much to answer these kinds of questions in the business world, and she's not wrong to feel that way

The experimentation layerfor AI workflows