Best Practices

5 Expensive Mistakes I Made Building with AI (And How I Fixed Them)

Julie Beynon

12 December 2024

Let's talk about learning things the hard way. You know those moments when you're staring at a computer screen thinking, "Well, that wasn't supposed to happen"? While implementing AI-powered data enrichment for our use cases like account fit scoring and upsell analysis, I made some... let's call them "educational mistakes." Here's what I learned, so you don't have to learn it the expensive way.

1. The "Bigger is Better" Trap: A Tale of LLM Selection

Picture this: Me, bright-eyed and bushy-tailed, throwing our most powerful (and expensive) GPT model at EVERY. SINGLE. TASK. Because if it costs more, it must be better, right? nervous laughter

Spoiler alert: That wasn't my finest moment.

After watching our costs climb faster than my coffee intake, I discovered something interesting: our lighter-weight model (gpt-4o-mini) could handle most tasks just as well as its bigger, pricier sibling. Want some numbers? I recently processed 180,000 accounts for fit scores and 14,000 for engagement scores - total cost: $120. Our daily processing now runs about $5 in production. That's a fraction of what traditional ICP fit score tools would charge!

I now save the heavyweight gpt-4o for special occasions, like quarterly ICP analysis or drafting those extra-important customer communications. You know, the ones where you actually need the AI equivalent of Shakespeare.

2. From Testing Everything to Testing Smart

Here's where I made my biggest breakthrough: being selective about both testing and processing. Initially, I was like a kid in a candy store, enriching ALL THE THINGS! Now? I'm more like a careful shopper, and it starts way before production.

Let's talk testing strategy (because discovering token costs on your entire database is painful). I grab about 100 companies I know inside and out - from Fortune 500s to small startups. When Microsoft suddenly scores lower than a local food truck, I know we've got a problem. This targeted testing lets me:

Iterate on prompts quickly without breaking the bank
Catch obvious scoring issues before they hit production
Extrapolate total costs before going big ("Oh, that would cost HOW much?")

Once I've got the prompts dialed in and costs looking reasonable, I stay picky about what goes into production:

Active accounts actually doing things (no point scoring zombie accounts)
Records with meaningful changes (20% employee growth? Yes. New favicon? No.)
New accounts that need their first scoring

3. The Field That Almost Broke the Bank

Ever heard the phrase "death by a thousand cuts"? Well, I discovered its data enrichment equivalent: death by a thousand unnecessary re-processes. My near-financial-disaster came from including frequently changing fields like "last_enriched_date" in our prompts. Every time these fields updated (which was... constantly), our system would helpfully re-process the entire row.

Imagine leaving the tap running and coming back to find your water bill could fund a small island nation. Yeah, it was kind of like that.

4. Safety First (Because Lessons Were Learned)

After my adventure in "How to Speed-Run Your AI Budget," I implemented some guardrails that saved our accounting team from having a collective heart attack:

Daily budget alerts at 50% & 100% thresholds
Hard automatic shutoffs at 100% of monthly budget caps
Restricted access to expensive models by environment - only prod leads can access gpt-4o
Set up Census sync alerts for any job processing more than 1,000 rows in a day (our normal is ~100-200)

5. The "Keep It Simple" Monitoring System

Our solution for tracking issues? A good old-fashioned Google Sheet where our sales team logs "hmm, that's not right" moments. No fancy systems, no complicated processes - just straight feedback from the people using the data. Sometimes the best solutions are the ones that don't require a computer science degree to understand.

What's Next?

I've got my eyes on something promising that could drop our costs even further: prompt caching. It's basically a way to optimize how we structure our prompts to let OpenAI reuse parts of them. Early tests look good - I'll share more once we've got real numbers to back it up.

Remember: The goal isn't to use AI everywhere - it's to use it smartly. And sometimes, being smart means learning from someone else's mistakes. Like mine. You're welcome! 😉