#6 — Tuning and Optimizing Workflows | Field Notes

Breaking it down: Multi-turn workflows deliver massive wins

Founders take note: Complex AI tasks aren't solved with single prompts. The real magic happens when you break them down into strategic steps.

AlphaCodium's approach proves this—they boosted GPT-4's coding accuracy from 19% to 44% by implementing a methodical workflow: problem analysis → test case reasoning → solution generation → solution ranking → synthetic testing → iterative refinement.

Why this matters: Your engineering team can achieve similar breakthroughs by adopting structured approaches that interface cleanly with your existing systems.

THE RELIABILITY IMPERATIVE

Deterministic > Non-deterministic (for now)

Smart founders are prioritizing deterministic workflows. Each non-deterministic agent action introduces failure risk—multiply that across steps and reliability plummets.

Instead, generate and execute plans deterministically. This approach:

Creates reusable few-shot examples
Simplifies testing and debugging
Makes failure analysis straightforward
Produces DAGs that are easier to comprehend

Pro tip: Treat AI agents like junior engineers—provide clear objectives and concrete execution plans.

BEYOND TEMPERATURE: STRATEGIC DIVERSITY

Temperature adjustments alone won't deliver the output variety your product needs. Savvy founders are implementing:

Strategic prompt element shuffling
Output tracking to prevent redundancy
Prompt phrasing variations

For example, a recommendation engine can shuffle historical user data and vary prompt construction to dramatically increase suggestion diversity.

THE CACHING ADVANTAGE

Underutilized opportunity alert: Caching delivers multiple competitive advantages:

Immediate cost reduction
Zero generation latency
Risk mitigation through pre-vetted responses

Implementation strategy:

Utilize unique IDs for processed items
Normalize user inputs with autocomplete and spelling correction to maximize cache hits

WHEN TO MAKE THE FINETUNING LEAP

Even brilliantly engineered prompts sometimes fall short. Successful founders like those behind Honeycomb's NLQ Assistant and Rechat's Lucy made strategic decisions to finetune when standard prompting couldn't deliver reliable, high-quality outputs.

Cost considerations: Finetuning requires significant investment in data annotation, model training, evaluation, and potentially self-hosting. Mitigate this by generating synthetic training data or bootstrapping with open-source datasets.

The founders who master these implementation strategies will build more capable, reliable AI products while controlling costs—creating sustainable competitive advantage in today's AI-driven landscape.

#6 — Tuning and Optimizing Workflows

THE RELIABILITY IMPERATIVE

BEYOND TEMPERATURE: STRATEGIC DIVERSITY

THE CACHING ADVANTAGE

WHEN TO MAKE THE FINETUNING LEAP

Keep reading

#7 — Working with models

#8 — A Survey of Techniques for Maximizing LLM Performance

#9 — Copywriting formulas

We’re actually here to help. Your ICPs have the words. We find them.

THE RELIABILITY IMPERATIVE

BEYOND TEMPERATURE: STRATEGIC DIVERSITY

THE CACHING ADVANTAGE

WHEN TO MAKE THE FINETUNING LEAP

Keep reading

#7 — Working with models

#8 — A Survey of Techniques for Maximizing LLM Performance

#9 — Copywriting formulas

We’re actuallyactually here to help. Your ICPs have the words. We find them.

We’re actually here to help. Your ICPs have the words. We find them.