A small web app for searching programming tasks and their reference solutions, backed entirely by SereneDB inverted indexes:
- Task search — BM25 full-text over problem statements, titles and tags.
- Code search — substring (
grep-grade, exact) and fuzzy ("shape of the code") search over solution source, powered by thesparse_ngramtokenizer (the GitHub code-search ngram scheme). - Best solutions per task — every task page lists its solutions, shortest first (a reasonable proxy for "best" since the dataset carries no runtime), with the shortest crowned 👑.
- Hybrid search — reciprocal-rank fusion of a text retriever (concept) and a code retriever (idiom) in one query; tasks matched by both rank first.
No frameworks, no build step, no third-party packages — just Python's standard
library and a hand-rolled Postgres-wire client (pgclient.py), so it runs on a
bare Python 3 with nothing installed.
The whole stack — database, index build, web UI, MCP endpoint and the optional
semantic-search model — comes up with one command. Nothing has to be built or
downloaded by hand and no dataset has to be staged locally: the db image
fetches the official static serened release and every dataset is streamed
straight from Hugging Face during the index build.
make run # build + start everything; site at http://localhost:8077Or invoke Compose directly:
cd docker
cp .env.example .env # tune ports / uid / build knobs (all optional)
docker compose up --build # first run builds the entire corpusOpen http://localhost:8077 once it's up. The site comes online during the build (it shows a CREATE INDEX progress bar); search starts working as soon as the one-shot loader finishes.
It starts seven services (see docker/README.md for the
service table and common operations):
data-init— chowns the bind-mounted../datadir (the derived embedding cache) toAPP_UID:APP_GID, then exits.serened— the database. Downloads the official static release (SERENED_VERSION, default26.06.2), persists its datadir in thesdb_datanamed volume, listens on 7899.ollama+ollama-init— serve and pullnomic-embed-textonce (powers the semantic "Explore" search; skip withWITH_VECTORS=0).loader— the one-shot builder. Runsdocker/fill.sh: build the text-search dictionaries →tasks_idx(sql_tasks_mega.sql) → task embeddings + HNSW →solutions_idx(sql_solutions_mega.sql), then exits ("fills and dies"). Idempotent.web—server.pyon 8077.mcp— hosted MCP endpoint on 8079 (claude mcp add codesearch --transport http http://<host>:8079/mcp).
| Service | URL / port | Override in .env |
|---|---|---|
| Web UI | http://localhost:8077 | WEB_PORT |
| Database (pg-wire) | 127.0.0.1:7899 |
DB_PORT |
| MCP endpoint | http://localhost:8079/mcp | MCP_PORT |
| Variable | Default | Effect |
|---|---|---|
SERENED_VERSION |
26.06.2 |
release tag to download into the db image |
WEB_PORT / DB_PORT / MCP_PORT |
8077 / 7899 / 8079 |
published host ports |
APP_UID / APP_GID |
1000 |
run the app containers as this uid:gid so the embedding cache under data/ stays owned by you (use id -u / id -g) |
WITH_VECTORS |
1 |
set 0 to skip the embedding/HNSW build (disables semantic "Explore") |
FORCE_REBUILD |
0 |
set 1 to rebuild the indexes even if solutions_idx already has rows |
The first docker compose up streams the full corpus from Hugging Face —
notably the entire open-r1/codeforces-submissions accepted-solution split
(≈11.4M solutions, filtered to source length 20–10000) plus the Codeforces /
code_contests / MBPP / HumanEval / Rosetta Code / IOI task sets. Expect a long
first run, and be aware that Hugging Face rate-limiting can slow or stall
it (the SQL already retries with backoff). Plan for a multi-core machine,
~16 GB+ RAM and tens of GB of free disk.
Every subsequent up is instant: the datadir survives in the sdb_data
volume, so the loader sees solutions_idx already populated and exits
immediately. docker compose down keeps the volume; docker compose down -v
wipes it for a clean rebuild.
Version note: 26.06.2 is the stock release. A columnstore fix for
multi-row text fetch is in a pending SereneDB PR; bump SERENED_VERSION once it
ships in a release.
Just Docker + Docker Compose. No host toolchain, no manual serened build,
no pre-staged dataset.
The dataset is downloaded fresh by a script, never committed — and only from sources whose license clearly permits reuse. The default ("safe") set:
| Dataset | License | What |
|---|---|---|
| MBPP | CC-BY-4.0 | 974 authored Python tasks + reference solutions |
| HumanEval | MIT | 164 authored tasks + canonical solutions |
Both are authored for the dataset, so bundling them is fine with attribution
(recorded in data/DATASET_LICENSE.txt at fetch time).
--source large additionally pulls
DeepMind code_contests
(CC-BY-4.0) — ~3.7k more problems + the shortest few accepted Python solutions
each, taking the corpus to ~4.9k tasks / ~20k solutions.
The full Codeforces corpus is what the Docker Compose build assembles. Its
sql_*_mega.sql index build reads these straight from Hugging Face parquet:
| Dataset | License | What |
|---|---|---|
| open-r1/codeforces | CC-BY-4.0 | ~9.5k problem statements (tasks) |
| open-r1/codeforces-submissions | CC-BY-4.0 | accepted submissions (solutions) — the full default split, ≈11.4M rows, filtered to source length 20–10000 |
The Docker build reads the entire submissions split — there is no sampling cap.
Codeforces' terms permit republishing statements with attribution + a source
link (shown per problem) as long as it is not an auto-judge. The unlicensed
MatrixStudio/Codeforces-Python-Submissions set is deliberately not used.
The standalone fetch_dataset.py --source codeforces path is a much smaller
statements-only fetch (open-r1/codeforces under ODC-By-1.0); it bundles no
Codeforces solution corpus. The full solution corpus only comes via the Docker
mega parquet build above.
CC-BY-4.0 / MIT are attribution-only — no share-alike, commercial use allowed — so showing the credit (below) is the whole obligation. The data is downloaded at build time and never committed, so this repo ships code + notices, not the datasets.
The Quick start above is the recommended way to run the whole stack. If you'd rather drive a SereneDB server you already have — e.g. against a local source build instead of the downloaded release — you can build a smaller, license-safe corpus by hand.
Server. For the demo, use the perf build (RelWithDebInfo, no
sanitizers) — it loads and queries far faster than the debug build, which
matters for the larger datasets:
cd /path/to/serenedb
cmake --preset perf -DCMAKE_C_COMPILER=clang-21 -DCMAKE_CXX_COMPILER=clang++-21
ninja -C build_perf serened
./build_perf/bin/serened /tmp/sdb_perf_data --server_endpoints='pgsql+tcp://0.0.0.0:7899'(The debug build at ./build/bin/serened also works but is much slower.)
Corpus + app.
cd codesearch-site
# 1. download a license-safe dataset and load it (re-runnable any time)
./setup.sh # MBPP + HumanEval (safe default, ~1.1k)
# ./setup.sh --source large # + DeepMind code_contests (CC-BY-4.0)
# ./setup.sh --source codeforces # also add open-r1/codeforces statements (ODC-By, statements only)
# ./setup.sh --db-port 7899 # point at a different server
# 2. serve the app
python3 server.py # http://127.0.0.1:8077/
# options: --port 8077 --db-host 127.0.0.1 --db-port 7899
# --auto-setup : if no dataset is loaded, download + index it on startupThis local path does not build the full ~11.4M-solution Codeforces corpus —
that only comes from the Docker sql_*_mega.sql build. To assemble the full
corpus against your own server, run those scripts directly (see
docker/fill.sh).
Attribution for whatever dataset is loaded is shown in the app (footer +
the "Data & licenses" page at #/about, served from /api/licenses), written
to data/DATASET_LICENSE.txt at fetch time, and recorded in
THIRD_PARTY_DATA.md with full notices under
licenses/.
setup.sh runs fetch_dataset.py (writes data/*.jsonl + license file) then
load.sql, which creates the cf_en / code_grams / code_grams_q
dictionaries and the tasks_idx / solutions_idx indexes. Both indexes are
index-covering: every column is either an index key or an INCLUDE column,
so result rows materialise from the index without a base-table read (confirmed
by EXPLAIN: a single IRESEARCH_SCAN, no table scan).
Open http://127.0.0.1:8077/.
load_test.py simulates active users against the JSON API with a realistic
traffic mix: task searches, code searches, unified searches, result-detail
opens, and occasional stats/health calls. By default --rpm-per-user auto
means 40 requests per minute per active user.
python3 load_test.py --users 100 --rpm-per-user auto --duration 2m
# target: 100 users * 40 rpm = 4000 rpm total (~66.7 rps)Useful knobs:
python3 load_test.py --base-url http://127.0.0.1:8077 --users 25 --rpm-per-user 30
python3 load_test.py --users 200 --duration 5m --ramp-up 30sThe script prints live throughput, status counts, endpoint mix, and final p50/p95/p99 latency numbers.
| File | Role |
|---|---|
fetch_dataset.py |
Zero-dep downloader: pulls a license-safe dataset from the HF rows API, normalises it to data/tasks.jsonl + data/solutions.jsonl, writes data/DATASET_LICENSE.txt. |
load.sql |
Ingests data/*.jsonl and builds the index-covering inverted indexes. |
setup.sh |
Orchestrates fetch + load (re-runnable). |
pgclient.py |
Zero-dependency Postgres v3 wire client (simple Query protocol) + bounded connection pool + SQL-literal escaping. |
db.py |
Query layer: every search/detail SQL builder over the SereneDB indexes. |
server.py |
stdlib ThreadingHTTPServer: JSON API under /api/*, static files from static/, per-IP rate limiting. |
static/ |
Single-page app — index.html, app.js (hash router + views), style.css. |
| Endpoint | Params | Returns |
|---|---|---|
GET /api/health |
— | health status |
GET /api/stats |
— | corpus counts |
GET /api/search/tasks |
q, min_rating?, max_rating?, limit? |
BM25-ranked tasks |
GET /api/search/code |
q, mode=exact|fuzzy, min_rating?, max_rating?, limit? |
matching solutions |
GET /api/search/hybrid |
text, code, limit? |
RRF-fused tasks |
GET /api/task |
id (e.g. mbpp/601, humaneval/0, or 843/D) |
task + solutions (shortest first) |
GET /api/solution |
id (integer) |
one solution with its task context |
- Exact substring: tokenize the query with the covering sparse-ngram
dictionary and require every gram (
code @@ ts_all(ts_tokenize(ARRAY[q], 'code_grams_q'))), then aLIKEpost-filter for exactness. The index narrows 30k rows to a handful;LIKEonly verifies those. - Fuzzy: require a majority of the covering grams
(
ts_any(..., ceil(0.55 * grams))), BM25-ranked so the closest shape wins. - Hybrid: rank tasks by BM25 text relevance and by how many accepted solutions contain the code idiom, then fuse the two rank lists with RRF.