MetalFish is a UCI chess engine built around a hybrid search: CPU alpha-beta with NNUE plus GPU MCTS with an Lc0-compatible transformer. The mature path is Apple Silicon with Metal/MPSGraph, and the CUDA branch brings the same shared NN/search contracts to NVIDIA Linux and Windows hosts.
The CPU alpha-beta + NNUE search (the engine's strength) runs on every platform, and the prebuilt binaries embed the NNUE. The transformer backend differs by platform:
| Platform | Prebuilt binary | Transformer backend |
|---|---|---|
| macOS Apple Silicon (arm64) | yes | Metal/MPSGraph |
| Linux x86-64 | yes | CPU (CUDA from source) |
| Linux arm64 | yes | CPU |
| Windows x86-64 | yes | CPU (CUDA from source) |
| Linux/Windows x86-64 + NVIDIA | from source (-DUSE_CUDA=ON) |
CUDA, parity-gated in CI/GCP |
MetalFish 1.0.0 is a working UCI engine, not a Stockfish replacement claim. Its playing strength comes from the CPU alpha-beta search with a Stockfish-family NNUE evaluation; the standalone MCTS path is an Lc0-compatible transformer search that loads BT4 weights.
The project's research contribution is the hybrid runtime: CPU NNUE alpha-beta and GPU transformer MCTS search the same position concurrently, exchange root-level signals, and a coordinator selects the final move. In practice the coordinator defers to the alpha-beta move when the GPU MCTS is not clearly better, so the hybrid's strength tracks its alpha-beta component — the GPU side is a research/diagnostic path, not a strength multiplier.
| Mode | UCI option | Purpose |
|---|---|---|
| Hybrid | UseHybridSearch true |
Primary engine for play |
| Alpha-Beta | UseHybridSearch false, UseMCTS false |
CPU fallback and diagnostics |
| MCTS | UseMCTS true |
Transformer search diagnostics |
Hybrid runs AB and MCTS in parallel. AB uses CPU NNUE and the transposition
table; MCTS uses the BT4 transformer through NNBackend: Metal/MPSGraph on
Apple Silicon, CUDA on supported NVIDIA Linux/Windows hosts, and portable CPU
only as a fallback/correctness path. The two searches share root signals and
the coordinator chooses the final move from both engines.
Benchmark results in this repository must use a fair, strength-first resource policy:
- One shared worker-thread budget per engine.
- On Apple Silicon,
autouses the performance-core count. - CPU engines receive the same
ThreadsandHashsettings where supported. - Pure MetalFish MCTS keeps its engine-recommended Apple worker cap unless a
test is explicitly measuring throughput scaling. For the current BT4/MPSGraph
backend this is one search worker with direct evals;
MCTSParallelSearchmust be enabled explicitly for pure-MCTS throughput experiments. - Lc0 receives the requested
Threadsvalue for its own backend. - MetalFish Hybrid receives the same total worker budget split between MCTS and AB workers.
- Tactical-suite scores are reported as tactical accuracy only, not Elo.
Run the fair Bratko-Kopec tactical sweep on the current machine:
python3 tests/paper_benchmarks.py --tactical --movetime 5000 \
--threads auto --hash auto \
--engines metalfish-ab,metalfish-mcts,metalfish-hybrid,lc0,stockfishThe script writes machine-readable results to results/paper_tactical.json and
a report to results/paper_summary.md. Any headline strength claims should be
based on those generated artifacts or on a larger cutechess tournament, not on
stale README tables.
For noisy tactical changes, add repeats and compare aggregate run counts:
python3 tests/paper_benchmarks.py --tactical --movetime 5000 \
--threads auto --hash auto --tactical-repeat 3For broad tactical regression testing, use an offline sample from the public
Lichess puzzle CSV instead of the 24-position smoke suite. This avoids API
quota and lets CI compare the same positions on main and the PR branch:
curl -L https://database.lichess.org/lichess_db_puzzle.csv.zst \
| zstd -dc \
| python3 tools/filter_lichess_puzzle_csv.py \
--out results/lichess_puzzles/sample.csv \
--max-puzzles 1000 --min-puzzles 1000 \
--min-rating 1200 --max-rating 2800 --min-popularity 70 \
--themes "advantage,mate,fork,sacrifice,skewer,pin,discoveredAttack"
python3 tools/lichess_puzzle_runner.py \
--offline-csv results/lichess_puzzles/sample.csv \
--engine build/metalfish \
--mode hybrid \
--weights networks/BT4-1024x15x32h-swa-6147500.pb \
--threads auto --hash-mb 4096 --movetime-ms 1000The 24-position Bratko-Kopec suite is a small tactical smoke test, useful for regression tracking but far too small and noisy to rank engines or estimate Elo. On an M2 Max at 5s/position the MetalFish configurations and references all land in the low-20s out of 24 — at the resolution limit of a 24-position suite. Two points matter for an honest reading:
- The hybrid's tactical score tracks its alpha-beta + NNUE component. On the
committed-config run the GPU transformer performed zero evaluations on every
solved position (
nn_evals=0inresults/paper_tactical.json), i.e. the result came from the CPU search, not the GPU MCTS. Hybrid-vs-AB differences on this suite are not GPU strength. - Per-position differences at this sample size are dominated by worker-count and run-to-run variance, not engine strength.
For any actual strength claim, use a large cutechess tournament with the fair resource policy below — never this suite.
A forced full-worker pure-MCTS stress run (MCTSMaxThreads=8) is not a
strength profile for the current backend. It previously scored 12/24 and timed
out on BK.24, which is why pure MCTS now caps Apple workers unless
MCTSParallelSearch=true is set.
MetalFish's pure-MCTS strength profile intentionally uses a small
MCTSMinimumKLDGainPerNode=0.00005 tactical stopper and the default
PureMCTSSmartPruningFactor=0.5. This keeps pure MCTS from freezing low-root
alternatives too early while leaving Hybrid's MCTS worker on its separate
hybrid profile. For an exact Lc0-style KLD-off diagnostic
comparison, run:
python3 tests/bk_parity.py --engine both --movetime 5000 \
--threads 8 --mcts-kld 0.0This is a tactical-suite result only. It is useful for regression tracking, but it is not an Elo estimate.
For game-strength testing, prefer cutechess matches with the same resource policy:
THREADS=8 HASH=4096 ./tools/run_cutechess_tournament.sh \
--games=20 --tc=300+0.1 --concurrency=1- Shared NN execution-plan, tensor-layout, output-decoding, and policy-map contracts across Metal and CUDA
- Metal/MPSGraph transformer inference
- CUDA transformer inference on NVIDIA Linux/Windows hosts
- Unified-memory CPU/GPU search architecture
- Accelerate/vDSP policy softmax
- NEON and ARM dot-product NNUE kernels
- 128-byte aligned MCTS nodes for Apple Silicon cache lines
- Native Apple CPU tuning and Release LTO by default
- Local Polyglot opening books for the Lichess bot
- Syzygy 3-4-5 tablebase support
- Experimental Core ML/ANE root-probe sidecar, disabled by default
Common:
- CMake 3.20 or newer
- C++20 compiler
- Python 3.11+ for bot and benchmark tooling
macOS Metal:
- Apple Silicon Mac, arm64
- macOS 13 or newer
- Xcode Command Line Tools
- Homebrew packages:
protobuf zlib abseil
brew install cmake protobuf zlib abseil
python3 -m pip install -r tests/requirements.txtLinux CUDA:
- NVIDIA GPU with a supported CUDA toolkit
- CUDA Toolkit 12.x, cuBLAS, protobuf, zlib, and Abseil development packages
Windows CUDA:
- Windows x86-64 with Visual Studio 2022/MSVC
- CUDA Toolkit 12.x
- vcpkg or the repository CI scripts for protobuf/zlib/Abseil dependencies
MetalFish expects network files under networks/.
mkdir -p networks
python3 tools/download_engine_networks.py --dest networksManual download equivalent:
curl -L -o networks/nn-c288c895ea92.nnue \
https://tests.stockfishchess.org/api/nn/nn-c288c895ea92.nnue
curl -L -o networks/nn-37f18f62d772.nnue \
https://tests.stockfishchess.org/api/nn/nn-37f18f62d772.nnue
curl -L -o networks/BT4-1024x15x32h-swa-6147500.pb.gz \
https://storage.lczero.org/files/networks-contrib/big-transformers/BT4-1024x15x32h-swa-6147500.pb.gz
gzip -dc networks/BT4-1024x15x32h-swa-6147500.pb.gz \
> networks/BT4-1024x15x32h-swa-6147500.pbEach tagged release publishes binaries for macOS, Linux, and Windows. The Stockfish NNUE is embedded in the binary, so alpha-beta play works with no downloads. Only the hybrid/MCTS transformer modes need the BT4 network.
| Platform | Asset |
|---|---|
| macOS Apple Silicon | metalfish-macos-arm64.tar.gz |
| Linux x86-64 | metalfish-linux-x86_64.tar.gz |
| Linux arm64 | metalfish-linux-arm64.tar.gz |
| Windows x86-64 | metalfish-windows-x86_64.zip |
# macOS / Linux: extract, then run
tar -xzf metalfish-<platform>.tar.gz && cd metalfish
./metalfish # alpha-beta works immediately (NNUE embedded)
# For hybrid/MCTS, fetch the BT4 transformer (bundled helper, Python 3, no deps):
python3 download_engine_networks.py --dest networks
# then over UCI:
# setoption name UseHybridSearch value true
# setoption name NNWeights value networks/BT4-1024x15x32h-swa-6147500.pbOn Windows, extract the .zip and run metalfish.exe the same way. Verify
downloads against the release SHA256SUMS.txt. NVIDIA CUDA builds are produced
from source with -DUSE_CUDA=ON (see Build, below).
macOS Metal:
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DUSE_METAL=ON
cmake --build build --target metalfish -j"$(sysctl -n hw.ncpu)"Linux CUDA:
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DUSE_CUDA=ON
cmake --build build --target metalfish --parallelWindows CUDA builds are validated through the MSVC/NVCC gate scripts and
GitHub workflow. For release-candidate packages, use the CUDA compile/runtime
gates documented in docs/cross_platform_backend_plan.md.
./build/metalfishRecommended UCI setup:
setoption name UseHybridSearch value true
setoption name NNWeights value networks/BT4-1024x15x32h-swa-6147500.pb
setoption name Threads value 8
setoption name Hash value 4096
setoption name Ponder value true
setoption name HybridABCandidateVerifyMs value 240
isready
position startpos
go movetime 5000
The Core ML/ANE path is an experimental Hybrid sidecar. It does not replace the
BT4 transformer MCTS and it is not enabled by default. The retained profile uses
the smaller T1-512 Lc0 network on cpu-ne as root evidence while BT4 continues
to run on Metal/MPSGraph:
setoption name HybridANERootProbe value true
setoption name HybridANERootHints value false
setoption name HybridANEWeights value networks/t1-512x15x8h-distilled-swa-3395000.pb.gz
setoption name HybridANEModelPath value build/coreml/compiled/t1-512-heads-b8.mlmodelc
setoption name HybridANEComputeUnits value cpu-ne
setoption name HybridANEOnlyPawnEndgames value false
setoption name HybridANEConfirmMCTSOverride value true
setoption name HybridANERootHintCount value 10
setoption name HybridANERootHintWaitMs value 0
setoption name HybridANEMinBudgetMs value 500
Current ANE findings:
cpu-neis the only retained Core ML compute-unit profile.allandcpu-gpuare faster in isolation but compete with Metal inference.- T1-512 is retained over T1-256. T1-256 is lower latency, but it regressed the ANE-sensitive repeat gate.
- The Lichess bot probes all roots when ANE is explicitly enabled. The current retained profile keeps root hints off, allows ANE to confirm narrowly gated MCTS overrides, and starts ANE probes at 500 ms or larger move budgets.
- ANE root hints are available as an explicit experiment, but they are disabled
by default. After the low-visit pawn-lever update, ANE root hints regressed
the local 1s BK repeat from 67/72 to 66/72, while ANE probe without root
hints matched the retained non-ANE profile. The retained wait profile remains
250 msfor explicit root-hint runs. - ANE is retained as a confirming sidecar, not as a third equal engine. Puzzle
summaries report
ANE searches, non-empty roots, top moves, failures, agreement with MCTS, and ANE-confirmed MCTS overrides so runs distinguish “configured” from “actually contributed.”
Useful local probes:
python3 tools/lichess_puzzle_runner.py \
--offline-csv results/judge_ane_broad_1000ms/hard200_sample.csv \
--engine build/metalfish --mode hybrid \
--weights networks/BT4-1024x15x32h-swa-6147500.pb \
--hybrid-ane-root-probe \
--hybrid-ane-confirm-mcts-override \
--hybrid-ane-weights networks/t1-512x15x8h-distilled-swa-3395000.pb.gz \
--hybrid-ane-model-path build/coreml/compiled/t1-512-heads-b8.mlmodelc \
--hybrid-ane-compute-units cpu-ne \
--hybrid-ane-min-budget-ms 500 \
--threads 8 --hash-mb 4096 --movetime-ms 3000
python3 tools/lc0_coreml_concurrency_benchmark.py \
networks/t1-512x15x8h-distilled-swa-3395000.pb.gz \
--metal-probe build/metalfish_nn_probe \
--metal-weights networks/BT4-1024x15x32h-swa-6147500.pb \
--coreml-compute-unit cpu-ne --batch-size 8
python3 tools/compare_puzzle_runs.py \
--baseline results/baseline.jsonl \
--candidate results/candidate.jsonl \
--match-repeat-ids --ane-summaryThe bot is configured around the Hybrid engine, local opening books, Syzygy, pondering, crash recovery, and resource-aware thread/hash sizing.
Download the included local Polyglot book set:
python3 tools/opening_book_manager.py download --book allRun a rated seek with an Elo floor:
python3 tools/lichess_bot.py --seek --rotate --no-casual \
--accept-rated --elo-seek --min-rated-opponent-elo 2200The bot will load books/*.bin automatically. Online Explorer fallback is off
by default; enable it only when needed with METALFISH_BOOK_ALLOW_ONLINE=true.
macOS:
cmake --build build --target metalfish_tests -j"$(sysctl -n hw.ncpu)"
./build/metalfish_tests
python3 tests/testing.py --quick
python3 tests/test_lichess_bot.py
python3 tests/test_lichess_puzzle_runner.py
python3 tests/test_ponder_stress.py --smoke
python3 tests/paper_benchmarks.py --tactical --movetime 5000 \
--threads auto --hash autoPortable Linux/Windows CPU and CUDA builds are covered by the GitHub workflows and GCP runtime gates. CUDA strength checks should use the same-commit Metal, Linux CUDA, and Windows CUDA artifacts so backend parity is measured against the same weights, search options, and source SHA.
Use precise language when discussing results:
- Valid: "MetalFish Hybrid solved X/24 Bratko-Kopec positions at 5s/position on an M2 Max using the fair benchmark script."
- Valid: "MetalFish MCTS loads the same BT4 Lc0 weights and uses compatible input/policy mapping, but has its own Metal/CUDA runtimes and search controls."
- Valid: "The hybrid engine combines CPU alpha-beta and GPU MCTS with live root-signal sharing and evidence-based final arbitration."
- Invalid without larger testing: "MetalFish is stronger than Stockfish" or "MetalFish has higher Elo than Lc0."
src/core/ Board representation and move generation
src/eval/ NNUE evaluation and Apple Silicon CPU kernels
src/nn/ BT4 transformer loader, encoder, Metal/MPSGraph, CUDA backends
src/search/ Alpha-beta search
src/mcts/ Lc0-style MCTS search
src/hybrid/ Parallel Hybrid coordinator
src/uci/ UCI protocol
tests/ Unit, UCI, ponder, and benchmark tests
tools/ Lichess bot, puzzle runner, books, Syzygy, tournaments
MetalFish builds on outstanding open-source work, all under GPL-3.0:
- Stockfish (https://github.com/official-stockfish/Stockfish) — the
alpha-beta search, transposition table, and NNUE evaluation are derived from
Stockfish, and the default
nn-*.nnuenetworks are Stockfish networks. - Leela Chess Zero / Lc0 (https://github.com/LeelaChessZero/lc0) — the MCTS
and the BT4 transformer input/policy encoding are Lc0-compatible, and the
BT4-1024x15x32hnetwork is an Lc0 network.
These projects are GPL-3.0, the same license MetalFish uses; see their repositories for their respective copyright holders.
MetalFish is released under GPL-3.0. See LICENSE.