Skip to content

SevenDataAI/data-agent-eval-kit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

data-agent-eval-kit

Production-ready evaluation templates for Data Agent, AI ask-data, AI fetch-data and NL2SQL workflows.

这个仓库解决一个很具体的问题:

AI 问数 Demo 很容易做出来,但上线前到底该怎么测?

很多团队只验证“模型能不能生成 SQL”。这不够。生产环境真正要验证的是:

  • 指标口径是否正确
  • SQL 是否可执行
  • 时间归属是否正确
  • 订单数、用户数、GMV 是否按正确粒度计算
  • 敏感字段是否被拦截
  • 大表查询是否带分区
  • Prompt / Skill / 模型改动后是否发生回归
  • Bad Case 能不能被归因和修复

data-agent-eval-kit 提供一套可以直接改造落地的模板。

What is inside

data-agent-eval-kit
├── templates
│   ├── gold_case.schema.yaml
│   ├── bad_case_taxonomy.yaml
│   └── judge_rubric.yaml
├── examples
│   ├── trade_gold_cases.yaml
│   └── baseline_sql
├── rules
│   └── sql_review_rules.yaml
├── evaluator
│   ├── evaluate_cases.py
│   └── requirements.txt
├── reports
│   └── sample_eval_report.md
└── docs
    ├── HOW_TO_BUILD_DATA_AGENT_EVALS.md
    └── PRODUCTION_READINESS_CHECKLIST.md

Evaluation framework

The evaluation is not a single score.

It is a production gate:

User question
  -> intent / metric / time / dimension extraction
  -> schema and metric matching
  -> SQL generation
  -> SQL review
  -> baseline comparison
  -> permission check
  -> bad case attribution
  -> regression report

Example Gold Case

case_id: TRADE_GMV_001
task_type: ask_data
priority: P0
question: 昨天平台支付 GMV 是多少?
business_context: 电商交易主题,统计支付成功商品行金额。
required_metrics:
  - pay_gmv
time_policy:
  field: pay_success_date
  range: yesterday
must_use:
  - sum(sku_pay_amount)
  - dt = '${yesterday}'
must_not_use:
  - sum(order_amount)
  - create_time
baseline_sql_file: examples/baseline_sql/TRADE_GMV_001.sql
expected_checks:
  - metric_consistency
  - time_policy_correctness
  - partition_filter_required
error_tags:
  - metric_error
  - time_policy_error
  - partition_missing

Quick start

cd evaluator
pip install -r requirements.txt
python evaluate_cases.py \
  --cases ../examples/trade_gold_cases.yaml \
  --sql-dir ../examples/generated_sql \
  --rules ../rules/sql_review_rules.yaml

The evaluator is intentionally simple. It does not replace your real SQL engine or business validation. It helps you run a first-pass gate:

  • required SQL fragments
  • forbidden SQL fragments
  • sensitive field detection
  • partition filter checks
  • dangerous SQL operation checks
  • case-level report generation

Who should use this

  • Data warehouse engineers building AI ask-data systems
  • Data platform teams building internal Data Agent
  • Analytics engineers evaluating NL2SQL
  • AI product teams building ChatBI / Text2SQL
  • Teams trying to move from demo to production

Why this matters

AI ask-data fails silently.

The dangerous case is not when SQL fails to execute. The dangerous case is when SQL executes successfully and returns a wrong number.

This repo is built around one principle:

Data Agent should not be trusted because it sounds fluent. It should be trusted only after it survives real business cases.

Roadmap

  • Add more business domains: user growth, inventory, marketing, finance
  • Add report-generation evaluation cases
  • Add Langfuse / LangSmith trace examples
  • Add dbt semantic layer examples
  • Add MCP tool evaluation templates
  • Add CI workflow for regression checks

About

Production-ready evaluation templates for Data Agent, AI ask-data, AI fetch-data and NL2SQL workflows.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages