AI Tutor Prompt Optimization: Cost-Aware Strategies for Multiple-Choice Questions
Evaluating Baseline, Self-Refine, and GEPA approaches for AI tutors under strict format requirements and cost-aware metrics.
Key Findings at a Glance
Tokens Per Correct
Optimizing for both accuracy and efficiency by measuring total tokens used per correct answer.
Format Compliance
Strict enforcement of "Answer: <LETTER>" format with zero tolerance for violations.
Hybrid Approach
SR→GEPA auditor with confidence threshold ≥0.85 improves robustness while reducing token costs.
The study evaluates three approaches to AI-tutor multiple-choice questions: a simple Baseline, two-call Self-Refine (SR), and GEPA's evolved prompts. The findings reveal that minimal reflective edits distilled into a single-call prompt can match SR's accuracy at lower cost, while a hybrid approach improves robustness on challenging datasets.
Understanding Prompt Strategies
Prompt Design Philosophy
Effective prompts for AI tutors follow specific patterns that guide the model's reasoning process:
  • Restate the question and quote relevant evidence
  • Ensure answer choices align with question stem (watch for NOT, EXCEPT, LEAST)
  • Clarify the question stem in simpler terms
  • Provide clear rationale before final answer
Dataset-specific prompts can include specialized instructions (like "Identify the flaw in the argument" for LSAT-LR), but these may confuse the model when applied to other domains like science or math.
Question Analysis
Restate and clarify the core question
Evidence Evaluation
Quote one relevant sentence as evidence
Elimination Process
Systematically eliminate wrong options
Final Answer
Provide "Answer: <LETTER>" format
GEPA Variant Selection Process
GEPA (Genetic-Pareto Prompt Evolution) creates multiple small edits to baseline instructions, labeled as Variants A, B, and C. These variants are then tested on a development split to identify the winner based on Pareto optimization: maximizing accuracy while minimizing token usage.
Create Variants
Generate small edits to base instructions (A/B/C)
Test on Dev Split
Evaluate each variant's performance
Pareto Selection
Choose best accuracy/token balance
Deploy Winner
Use winning prompt on test set
Similar variants may emerge across different datasets because they share the same starting point and are constrained by tight reflection parameters. The winning variant is selected based on both accuracy and efficiency metrics.
Grading Methodology
Load Dataset
Each question dataset is loaded for processing
Render Prompt
Format prompt with allowed letters and strict output rule
Run Model
Execute Baseline, Self-Refine, or GEPA-chosen prompt
Parse Last Line
Extract only the final "Answer: <LETTER>" line
Compare to Key
Check against answer key for correctness
Calculate Metrics
Aggregate accuracy and token cost data
Format compliance is strictly enforced using regex: (?m)^Answer:\s*([A-Z])\s*$. Any deviation results in a score of 0 and a recorded format violation. This "fail closed" approach ensures consistent evaluation across all methods.
Methodological Approaches
Baseline Prompt
Single-call approach instructing the model to restate the question, cite evidence, eliminate wrong options, and output "Answer: <LETTER>".
Self-Refine (SR)
Two-call chain where the first call proposes an answer and the second critiques and revises. Improves accuracy but doubles token cost and latency.
GEPA
Collects failed examples, uses reflection to propose concise edits, creates variants A/B/C, evaluates on dev set, and selects the best performer using Pareto filtering.
Hybrid SR→GEPA Approach
This innovative approach passes Self-Refine output to a GEPA auditor with strict override rules:
  • GEPA may only overrule SR when confidence ≥ 0.85
  • The auditor must explicitly flag a specific error type
  • Error types include: misread stem, polarity flip, eliminated correct option, format violation
Confidence is measured using a rubric score s ∈ [0,1] calibrated on the dev set. Overrides require both high confidence and a one-sentence rationale tied to the error label.
Evaluation Metrics
Accuracy
Percentage of correctly answered questions on development and test sets
Tokens
Average tokens per question across all model calls
Cost
Tokens per correct answer and USD cost per 100 correct answers
Format
Percentage of answers following strict "Answer: <LETTER>" format
Cost accounting includes both prompt and completion tokens for every call. For Self-Refine, the second call includes the first call's transcript where applicable, accurately reflecting real-world usage costs.
Dataset Selection
The study utilizes publicly available multiple-choice question (MCQ) datasets spanning diverse domains:
Logic
Includes LSAT-LR (Logical Reasoning) and similar datasets that test analytical reasoning and argument evaluation
Science
Questions covering various scientific disciplines requiring factual knowledge and conceptual understanding
Mathematics
Problems testing mathematical reasoning, computation, and problem-solving skills
Truthfulness
Questions assessing the model's ability to identify factual accuracy and detect misinformation
Each dataset is split into train, development, and test sets. Only the development set influences GEPA optimization, while the test set remains unseen until final evaluation to ensure unbiased assessment.
Key Results
Across 40+ experimental runs, several key patterns emerged:
Naïve GEPA Issues
Unoptimized GEPA edits often degraded accuracy and increased token usage through verbose reasoning and format violations.
Self-Refine Performance
SR achieved up to 100% accuracy on easier datasets but at high cost (~600 tokens per question).
Minimal Scaffolding Success
Two short reasoning lines plus the answer often outperformed elaborate prompts, matching accuracy with fewer tokens.
Hybrid Improvements
SR→GEPA with confidence ≥0.85 improved LSAT-LR accuracy from 0.1→0.4 while reducing tokens per correct from 2794 to 857.
The study extends prior DSPy evaluations by demonstrating that minimal scaffolding with strict format rules often outperforms verbose prompts when optimizing for tokens-per-correct in tutor-style MCQs.
Discussion and Insights
Key Takeaways
The research reveals several important insights for AI tutor prompt optimization:
  • More rules aren't always better - over-complex prompts can confuse the model and increase costs
  • Self-Refine remains strong for accuracy but at higher expense
  • GEPA shines as an efficiency tuner when optimizing for fewer tokens under constant accuracy
  • Hybrid SR→GEPA with strict confidence thresholds prevents incorrect overrides
Future Research Directions
Several promising avenues for future exploration include:
  • Context-aware GEPA prompts tailored to specific dataset domains
  • Dynamic token budgets to avoid unnecessary reflections
  • Enhanced auditors that cross-check stem/choice polarity
  • Evidence consistency verification mechanisms
Optimizer Decision Matrix
The decision matrix helps practitioners select the optimal approach based on their specific requirements. For applications requiring strict schema/JSON compliance, MIPROv2 is recommended as a strong default. For tutor-style reasoning under cost constraints, GEPA's reflective edits often provide the best balance of accuracy and efficiency.
Production Implementation Tips
1
Comprehensive Logging
Log all traces and evaluations (using MLflow or equivalent) to ensure reproducibility and enable drift detection over time.
2
Regression Testing
Maintain a frozen evaluation set with fixed seed for consistent regression testing across model or prompt updates.
3
Format Enforcement
Implement strict format compliance checks in code (using regex) and adopt a fail-closed approach to violations.
4
Cost Tracking
Monitor tokens_per_correct and cost_per_100_correct as first-class KPIs alongside accuracy metrics.
5
Result Storage
Store per-run CSVs containing accuracy, token usage, and format violations in a /results/ directory with README links.
These practical implementation details, while seemingly minor, significantly contribute to the maintainability and reliability of AI tutor systems in production environments.
Related Work and Context
DSPy
A framework for declarative LLM programs with pluggable optimizers like BootstrapFewShot and MIPROv2. This research fits the same slot but with a different objective: accuracy under cost with format guarantees.
Self-Refine
Two-call critique/revise approach that delivers strong accuracy but at higher token cost and increased latency compared to single-call methods.
GEPA
Reflective, Pareto-guided prompt evolution; this study adapts its reflect-mutate-select pattern specifically for tutor-style MCQs with strict output format requirements.
This research builds upon and extends these foundational approaches, focusing specifically on the tutor-style multiple-choice question domain with its unique requirements for both accuracy and cost efficiency.
Acknowledgments and Reproducibility
Acknowledgments
Thanks to Stanislav Huseletov for the pragmatic DSPy framing that this work builds upon, and to coworkers who shared early feedback on rubric design, auditing rules, and the evaluation harness.
Repository Access
All code, data, and results are available in the public repository:
Reproducibility Resources
  • Runs & artifacts: CSVs for all approaches including accuracy, tokens per question, tokens per correct, and format violations
  • Environment details: Model names, temperatures, max tokens, seeds, and split definitions
  • Configuration: Complete setup parameters in the repo's configs and logs for each experimental run