Production-ready evaluation templates for Data Agent, AI ask-data, AI fetch-data and NL2SQL workflows.
这个仓库解决一个很具体的问题:
AI 问数 Demo 很容易做出来,但上线前到底该怎么测?
很多团队只验证“模型能不能生成 SQL”。这不够。生产环境真正要验证的是:
- 指标口径是否正确
- SQL 是否可执行
- 时间归属是否正确
- 订单数、用户数、GMV 是否按正确粒度计算
- 敏感字段是否被拦截
- 大表查询是否带分区
- Prompt / Skill / 模型改动后是否发生回归
- Bad Case 能不能被归因和修复
data-agent-eval-kit 提供一套可以直接改造落地的模板。
data-agent-eval-kit
├── templates
│ ├── gold_case.schema.yaml
│ ├── bad_case_taxonomy.yaml
│ └── judge_rubric.yaml
├── examples
│ ├── trade_gold_cases.yaml
│ └── baseline_sql
├── rules
│ └── sql_review_rules.yaml
├── evaluator
│ ├── evaluate_cases.py
│ └── requirements.txt
├── reports
│ └── sample_eval_report.md
└── docs
├── HOW_TO_BUILD_DATA_AGENT_EVALS.md
└── PRODUCTION_READINESS_CHECKLIST.md
The evaluation is not a single score.
It is a production gate:
User question
-> intent / metric / time / dimension extraction
-> schema and metric matching
-> SQL generation
-> SQL review
-> baseline comparison
-> permission check
-> bad case attribution
-> regression report
case_id: TRADE_GMV_001
task_type: ask_data
priority: P0
question: 昨天平台支付 GMV 是多少?
business_context: 电商交易主题,统计支付成功商品行金额。
required_metrics:
- pay_gmv
time_policy:
field: pay_success_date
range: yesterday
must_use:
- sum(sku_pay_amount)
- dt = '${yesterday}'
must_not_use:
- sum(order_amount)
- create_time
baseline_sql_file: examples/baseline_sql/TRADE_GMV_001.sql
expected_checks:
- metric_consistency
- time_policy_correctness
- partition_filter_required
error_tags:
- metric_error
- time_policy_error
- partition_missingcd evaluator
pip install -r requirements.txt
python evaluate_cases.py \
--cases ../examples/trade_gold_cases.yaml \
--sql-dir ../examples/generated_sql \
--rules ../rules/sql_review_rules.yamlThe evaluator is intentionally simple. It does not replace your real SQL engine or business validation. It helps you run a first-pass gate:
- required SQL fragments
- forbidden SQL fragments
- sensitive field detection
- partition filter checks
- dangerous SQL operation checks
- case-level report generation
- Data warehouse engineers building AI ask-data systems
- Data platform teams building internal Data Agent
- Analytics engineers evaluating NL2SQL
- AI product teams building ChatBI / Text2SQL
- Teams trying to move from demo to production
AI ask-data fails silently.
The dangerous case is not when SQL fails to execute. The dangerous case is when SQL executes successfully and returns a wrong number.
This repo is built around one principle:
Data Agent should not be trusted because it sounds fluent. It should be trusted only after it survives real business cases.
- Add more business domains: user growth, inventory, marketing, finance
- Add report-generation evaluation cases
- Add Langfuse / LangSmith trace examples
- Add dbt semantic layer examples
- Add MCP tool evaluation templates
- Add CI workflow for regression checks