data-agent-eval-kit

Production-ready evaluation templates for Data Agent, AI ask-data, AI fetch-data and NL2SQL workflows.

这个仓库解决一个很具体的问题：

AI 问数 Demo 很容易做出来，但上线前到底该怎么测？

很多团队只验证“模型能不能生成 SQL”。这不够。生产环境真正要验证的是：

指标口径是否正确
SQL 是否可执行
时间归属是否正确
订单数、用户数、GMV 是否按正确粒度计算
敏感字段是否被拦截
大表查询是否带分区
Prompt / Skill / 模型改动后是否发生回归
Bad Case 能不能被归因和修复

data-agent-eval-kit 提供一套可以直接改造落地的模板。

What is inside

data-agent-eval-kit
├── templates
│   ├── gold_case.schema.yaml
│   ├── bad_case_taxonomy.yaml
│   └── judge_rubric.yaml
├── examples
│   ├── trade_gold_cases.yaml
│   └── baseline_sql
├── rules
│   └── sql_review_rules.yaml
├── evaluator
│   ├── evaluate_cases.py
│   └── requirements.txt
├── reports
│   └── sample_eval_report.md
└── docs
    ├── HOW_TO_BUILD_DATA_AGENT_EVALS.md
    └── PRODUCTION_READINESS_CHECKLIST.md

Evaluation framework

The evaluation is not a single score.

It is a production gate:

User question
  -> intent / metric / time / dimension extraction
  -> schema and metric matching
  -> SQL generation
  -> SQL review
  -> baseline comparison
  -> permission check
  -> bad case attribution
  -> regression report

Example Gold Case

case_id: TRADE_GMV_001
task_type: ask_data
priority: P0
question: 昨天平台支付 GMV 是多少？
business_context: 电商交易主题，统计支付成功商品行金额。
required_metrics:
  - pay_gmv
time_policy:
  field: pay_success_date
  range: yesterday
must_use:
  - sum(sku_pay_amount)
  - dt = '${yesterday}'
must_not_use:
  - sum(order_amount)
  - create_time
baseline_sql_file: examples/baseline_sql/TRADE_GMV_001.sql
expected_checks:
  - metric_consistency
  - time_policy_correctness
  - partition_filter_required
error_tags:
  - metric_error
  - time_policy_error
  - partition_missing

Quick start

cd evaluator
pip install -r requirements.txt
python evaluate_cases.py \
  --cases ../examples/trade_gold_cases.yaml \
  --sql-dir ../examples/generated_sql \
  --rules ../rules/sql_review_rules.yaml

The evaluator is intentionally simple. It does not replace your real SQL engine or business validation. It helps you run a first-pass gate:

required SQL fragments
forbidden SQL fragments
sensitive field detection
partition filter checks
dangerous SQL operation checks
case-level report generation

Who should use this

Data warehouse engineers building AI ask-data systems
Data platform teams building internal Data Agent
Analytics engineers evaluating NL2SQL
AI product teams building ChatBI / Text2SQL
Teams trying to move from demo to production

Why this matters

AI ask-data fails silently.

The dangerous case is not when SQL fails to execute. The dangerous case is when SQL executes successfully and returns a wrong number.

This repo is built around one principle:

Data Agent should not be trusted because it sounds fluent. It should be trusted only after it survives real business cases.

Roadmap

Add more business domains: user growth, inventory, marketing, finance
Add report-generation evaluation cases
Add Langfuse / LangSmith trace examples
Add dbt semantic layer examples
Add MCP tool evaluation templates
Add CI workflow for regression checks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

data-agent-eval-kit

What is inside

Evaluation framework

Example Gold Case

Quick start

Who should use this

Why this matters

Roadmap

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
docs		docs
evaluator		evaluator
examples		examples
reports		reports
rules		rules
templates		templates
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

data-agent-eval-kit

What is inside

Evaluation framework

Example Gold Case

Quick start

Who should use this

Why this matters

Roadmap

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages