MolKit logo
Tools

Step 10 of 10

Validate response quality and harden the system

Review a sample of generated replies for tone, accuracy, and safety, then tighten your prompt constraints based on what you find.

Why this matters

An automation that sends bad emails at scale does more harm than sending no emails at all. Quality validation is not optional, it is what separates a production system from a demo. Even well-structured prompts can produce replies that are too generic, contain hallucinated facts, or miss the context of the specific inquiry type. Catching these patterns now, before real leads receive them, is the purpose of this step.

Build instructions

Generate a diverse test sample

  1. Step 1

    Submit 10 test form entries with different inquiry types and message variations. Use these scenarios: 'General question' with a one-line message. 'Pricing' with a detailed budget description. 'Project inquiry' with a technical problem statement. 'Support' with a complaint. 'General question' with a message in a language other than English.

  2. Step 2

    For each submission, capture the AI-generated reply from the Zap run history or your BCC inbox. Paste all 10 replies into a document or Google Sheet for side-by-side comparison.

  3. Step 3

    Do not cherry-pick only the best replies. Include any that look problematic.

Score each reply against a quality rubric

Rate each reply on a 1-5 scale for three dimensions. 5 = excellent, 3 = acceptable, 1 = unacceptable.

  1. Step 1

    Clarity (1-5): Is the reply immediately understandable without re-reading? Does it use plain language? Does it avoid jargon the lead may not know?

  2. Step 2

    Relevance (1-5): Does the reply specifically address the inquiry type? Does it reference anything the lead mentioned in their message? Or is it a generic template that ignores the context?

  3. Step 3

    Actionability (1-5): Is there one clear next step? Does the lead know what will happen next and when? Or does the reply just say 'we will be in touch' with no specifics?

  4. Step 4

    Calculate the average score for each dimension across your 10 replies. Any dimension averaging below 3.5 needs prompt revision.

Identify failure patterns and fix the prompt

  1. Step 1

    Look for recurring problems. Common patterns: all replies sound the same regardless of inquiry type (fix: add inquiry-type-specific instructions), replies promise follow-up 'soon' without specifics (fix: add 'Do not use vague time words like soon or shortly' to FORBIDDEN), replies are too long (fix: lower Max Tokens and add a word count requirement to the prompt).

  2. Step 2

    Rewrite only the specific part of the prompt that addresses the failure. Do not rewrite the entire prompt, you may break what was working.

  3. Step 3

    Re-run 5 new test submissions after each prompt change and re-score them. Only move to the next change once the scores improve.

Set your go-live threshold

  1. Step 1

    Define what score means 'ready'. For example: average Clarity ≥ 4, Relevance ≥ 4, Actionability ≥ 3.5 across a 10-reply sample.

  2. Step 2

    Do not set the bar at 5/5: LLM output is probabilistic and will never be perfect on every run. Set a bar that reflects what you would find acceptable if you had written the email yourself on a busy day.

  3. Step 3

    When your sample meets the threshold, save your prompt to the Config tab in the Google Sheet so you have a versioned record of what is in production.

Common mistakes

  • Reviewing only one or two replies and concluding the system is ready. A small sample misses edge cases. Review at least 10 before making a quality judgment.
  • Trying to fix quality issues by changing the AI model (e.g., switching from GPT-3.5 to GPT-4) before improving the prompt. Model quality differences are real but small compared to prompt quality differences. Fix the prompt first.
  • Setting a quality threshold of 5/5 that the system will never reliably meet. Aim for consistent acceptable output, not occasional perfect output.

Pro tips

  • Keep a log of every prompt version and the average quality score it produced. This history becomes invaluable when the team wants to understand why a prompt changed.
  • Ask a colleague who does not know this is AI-generated to read 3 of your best replies. If they cannot tell they were AI-written, your system is ready.

Before you continue

You have scored at least 10 replies against your rubric. All three dimensions average at or above your defined threshold. The prompt is saved in the Config tab. You are comfortable putting your name on any of the top 8 out of 10 replies.

Step result

You have a quality baseline, a repeatable evaluation method, and a production-ready prompt that you have verified against real lead scenarios.