SereneCodeSearch — code search over programming tasks

A small web app for searching programming tasks and their reference solutions, backed entirely by SereneDB inverted indexes:

Task search — BM25 full-text over problem statements, titles and tags.
Code search — substring (grep-grade, exact) and fuzzy ("shape of the code") search over solution source, powered by the sparse_ngram tokenizer (the GitHub code-search ngram scheme).
Best solutions per task — every task page lists its solutions, shortest first (a reasonable proxy for "best" since the dataset carries no runtime), with the shortest crowned 👑.
Hybrid search — reciprocal-rank fusion of a text retriever (concept) and a code retriever (idiom) in one query; tasks matched by both rank first.

No frameworks, no build step, no third-party packages — just Python's standard library and a hand-rolled Postgres-wire client (pgclient.py), so it runs on a bare Python 3 with nothing installed.

Quick start (Docker Compose)

The whole stack — database, index build, web UI, MCP endpoint and the optional semantic-search model — comes up with one command. Nothing has to be built or downloaded by hand and no dataset has to be staged locally: the db image fetches the official static serened release and every dataset is streamed straight from Hugging Face during the index build.

make run                      # build + start everything; site at http://localhost:8077

Or invoke Compose directly:

cd docker
cp .env.example .env          # tune ports / uid / build knobs (all optional)
docker compose up --build     # first run builds the entire corpus

Open http://localhost:8077 once it's up. The site comes online during the build (it shows a CREATE INDEX progress bar); search starts working as soon as the one-shot loader finishes.

What `docker compose up` does

It starts seven services (see docker/README.md for the service table and common operations):

data-init — chowns the bind-mounted ../data dir (the derived embedding cache) to APP_UID:APP_GID, then exits.
serened — the database. Downloads the official static release (SERENED_VERSION, default 26.06.2), persists its datadir in the sdb_data named volume, listens on 7899.
ollama + ollama-init — serve and pull nomic-embed-text once (powers the semantic "Explore" search; skip with WITH_VECTORS=0).
loader — the one-shot builder. Runs docker/fill.sh: build the text-search dictionaries → tasks_idx (sql_tasks_mega.sql) → task embeddings + HNSW → solutions_idx (sql_solutions_mega.sql), then exits ("fills and dies"). Idempotent.
web — server.py on 8077.
mcp — hosted MCP endpoint on 8079 (claude mcp add codesearch --transport http http://<host>:8079/mcp).

URLs and ports

Service	URL / port	Override in `.env`
Web UI	http://localhost:8077	`WEB_PORT`
Database (pg-wire)	`127.0.0.1:7899`	`DB_PORT`
MCP endpoint	http://localhost:8079/mcp	`MCP_PORT`

Env knobs (`docker/.env`, copied from `.env.example`)

Variable	Default	Effect
`SERENED_VERSION`	`26.06.2`	release tag to download into the db image
`WEB_PORT` / `DB_PORT` / `MCP_PORT`	`8077` / `7899` / `8079`	published host ports
`APP_UID` / `APP_GID`	`1000`	run the app containers as this uid:gid so the embedding cache under `data/` stays owned by you (use `id -u` / `id -g`)
`WITH_VECTORS`	`1`	set `0` to skip the embedding/HNSW build (disables semantic "Explore")
`FORCE_REBUILD`	`0`	set `1` to rebuild the indexes even if `solutions_idx` already has rows

Heads-up: the first run is a heavy build

The first docker compose up streams the full corpus from Hugging Face — notably the entire open-r1/codeforces-submissions accepted-solution split (≈11.4M solutions, filtered to source length 20–10000) plus the Codeforces / code_contests / MBPP / HumanEval / Rosetta Code / IOI task sets. Expect a long first run, and be aware that Hugging Face rate-limiting can slow or stall it (the SQL already retries with backoff). Plan for a multi-core machine, ~16 GB+ RAM and tens of GB of free disk.

Every subsequent up is instant: the datadir survives in the sdb_data volume, so the loader sees solutions_idx already populated and exits immediately. docker compose down keeps the volume; docker compose down -v wipes it for a clean rebuild.

Version note: 26.06.2 is the stock release. A columnstore fix for multi-row text fetch is in a pending SereneDB PR; bump SERENED_VERSION once it ships in a release.

Prerequisites

Just Docker + Docker Compose. No host toolchain, no manual serened build, no pre-staged dataset.

Dataset & licensing

The dataset is downloaded fresh by a script, never committed — and only from sources whose license clearly permits reuse. The default ("safe") set:

Dataset	License	What
MBPP	CC-BY-4.0	974 authored Python tasks + reference solutions
HumanEval	MIT	164 authored tasks + canonical solutions

Both are authored for the dataset, so bundling them is fine with attribution (recorded in data/DATASET_LICENSE.txt at fetch time).

--source large additionally pulls DeepMind code_contests (CC-BY-4.0) — ~3.7k more problems + the shortest few accepted Python solutions each, taking the corpus to ~4.9k tasks / ~20k solutions.

The full Codeforces corpus is what the Docker Compose build assembles. Its sql_*_mega.sql index build reads these straight from Hugging Face parquet:

Dataset	License	What
open-r1/codeforces	CC-BY-4.0	~9.5k problem statements (tasks)
open-r1/codeforces-submissions	CC-BY-4.0	accepted submissions (solutions) — the full `default` split, ≈11.4M rows, filtered to source length 20–10000

The Docker build reads the entire submissions split — there is no sampling cap. Codeforces' terms permit republishing statements with attribution + a source link (shown per problem) as long as it is not an auto-judge. The unlicensed MatrixStudio/Codeforces-Python-Submissions set is deliberately not used.

The standalone fetch_dataset.py --source codeforces path is a much smaller statements-only fetch (open-r1/codeforces under ODC-By-1.0); it bundles no Codeforces solution corpus. The full solution corpus only comes via the Docker mega parquet build above.

CC-BY-4.0 / MIT are attribution-only — no share-alike, commercial use allowed — so showing the credit (below) is the whole obligation. The data is downloaded at build time and never committed, so this repo ships code + notices, not the datasets.

Running locally without Docker (manual alternative)

The Quick start above is the recommended way to run the whole stack. If you'd rather drive a SereneDB server you already have — e.g. against a local source build instead of the downloaded release — you can build a smaller, license-safe corpus by hand.

Server. For the demo, use the perf build (RelWithDebInfo, no sanitizers) — it loads and queries far faster than the debug build, which matters for the larger datasets:

cd /path/to/serenedb
cmake --preset perf -DCMAKE_C_COMPILER=clang-21 -DCMAKE_CXX_COMPILER=clang++-21
ninja -C build_perf serened
./build_perf/bin/serened /tmp/sdb_perf_data --server_endpoints='pgsql+tcp://0.0.0.0:7899'

(The debug build at ./build/bin/serened also works but is much slower.)

Corpus + app.

cd codesearch-site

# 1. download a license-safe dataset and load it (re-runnable any time)
./setup.sh                       # MBPP + HumanEval (safe default, ~1.1k)
#   ./setup.sh --source large        # + DeepMind code_contests (CC-BY-4.0)
#   ./setup.sh --source codeforces   # also add open-r1/codeforces statements (ODC-By, statements only)
#   ./setup.sh --db-port 7899        # point at a different server

# 2. serve the app
python3 server.py                # http://127.0.0.1:8077/
#   options: --port 8077  --db-host 127.0.0.1  --db-port 7899
#   --auto-setup : if no dataset is loaded, download + index it on startup

This local path does not build the full ~11.4M-solution Codeforces corpus — that only comes from the Docker sql_*_mega.sql build. To assemble the full corpus against your own server, run those scripts directly (see docker/fill.sh).

Attribution for whatever dataset is loaded is shown in the app (footer + the "Data & licenses" page at #/about, served from /api/licenses), written to data/DATASET_LICENSE.txt at fetch time, and recorded in THIRD_PARTY_DATA.md with full notices under licenses/.

setup.sh runs fetch_dataset.py (writes data/*.jsonl + license file) then load.sql, which creates the cf_en / code_grams / code_grams_q dictionaries and the tasks_idx / solutions_idx indexes. Both indexes are index-covering: every column is either an index key or an INCLUDE column, so result rows materialise from the index without a base-table read (confirmed by EXPLAIN: a single IRESEARCH_SCAN, no table scan).

Open http://127.0.0.1:8077/.

Load testing

load_test.py simulates active users against the JSON API with a realistic traffic mix: task searches, code searches, unified searches, result-detail opens, and occasional stats/health calls. By default --rpm-per-user auto means 40 requests per minute per active user.

python3 load_test.py --users 100 --rpm-per-user auto --duration 2m
# target: 100 users * 40 rpm = 4000 rpm total (~66.7 rps)

Useful knobs:

python3 load_test.py --base-url http://127.0.0.1:8077 --users 25 --rpm-per-user 30
python3 load_test.py --users 200 --duration 5m --ramp-up 30s

The script prints live throughput, status counts, endpoint mix, and final p50/p95/p99 latency numbers.

Layout

File	Role
`fetch_dataset.py`	Zero-dep downloader: pulls a license-safe dataset from the HF rows API, normalises it to `data/tasks.jsonl` + `data/solutions.jsonl`, writes `data/DATASET_LICENSE.txt`.
`load.sql`	Ingests `data/*.jsonl` and builds the index-covering inverted indexes.
`setup.sh`	Orchestrates fetch + load (re-runnable).
`pgclient.py`	Zero-dependency Postgres v3 wire client (simple Query protocol) + bounded connection pool + SQL-literal escaping.
`db.py`	Query layer: every search/detail SQL builder over the SereneDB indexes.
`server.py`	stdlib `ThreadingHTTPServer`: JSON API under `/api/*`, static files from `static/`, per-IP rate limiting.
`static/`	Single-page app — `index.html`, `app.js` (hash router + views), `style.css`.

API

Endpoint	Params	Returns
`GET /api/health`	—	health status
`GET /api/stats`	—	corpus counts
`GET /api/search/tasks`	`q`, `min_rating?`, `max_rating?`, `limit?`	BM25-ranked tasks
`GET /api/search/code`	`q`, `mode=exact\|fuzzy`, `min_rating?`, `max_rating?`, `limit?`	matching solutions
`GET /api/search/hybrid`	`text`, `code`, `limit?`	RRF-fused tasks
`GET /api/task`	`id` (e.g. `mbpp/601`, `humaneval/0`, or `843/D`)	task + solutions (shortest first)
`GET /api/solution`	`id` (integer)	one solution with its task context

How search maps to SereneDB

Exact substring: tokenize the query with the covering sparse-ngram dictionary and require every gram (code @@ ts_all(ts_tokenize(ARRAY[q], 'code_grams_q'))), then a LIKE post-filter for exactness. The index narrows 30k rows to a handful; LIKE only verifies those.
Fuzzy: require a majority of the covering grams (ts_any(..., ceil(0.55 * grams))), BM25-ranked so the closest shape wins.
Hybrid: rank tasks by BM25 text relevance and by how many accepted solutions contain the code idiom, then fuse the two rank lists with RRF.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SereneCodeSearch — code search over programming tasks

Quick start (Docker Compose)

What `docker compose up` does

URLs and ports

Env knobs (`docker/.env`, copied from `.env.example`)

Heads-up: the first run is a heavy build

Prerequisites

Dataset & licensing

Running locally without Docker (manual alternative)

Load testing

Layout

API

How search maps to SereneDB

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
docker		docker
licenses		licenses
static		static
.dockerignore		.dockerignore
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
OVERVIEW.md		OVERVIEW.md
README.md		README.md
THIRD_PARTY_DATA.md		THIRD_PARTY_DATA.md
db.py		db.py
embed_tasks.py		embed_tasks.py
fetch_dataset.py		fetch_dataset.py
load.sql		load.sql
load_codeforces.sql		load_codeforces.sql
load_test.py		load_test.py
load_vectors.sql		load_vectors.sql
mcp_http.py		mcp_http.py
mcp_server.py		mcp_server.py
pgclient.py		pgclient.py
server.py		server.py
setup.sh		setup.sh
sql_solutions_mega.sql		sql_solutions_mega.sql
sql_tasks_mega.sql		sql_tasks_mega.sql

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

SereneCodeSearch — code search over programming tasks

Quick start (Docker Compose)

What docker compose up does

URLs and ports

Env knobs (docker/.env, copied from .env.example)

Heads-up: the first run is a heavy build

Prerequisites

Dataset & licensing

Running locally without Docker (manual alternative)

Load testing

Layout

API

How search maps to SereneDB

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

What `docker compose up` does

Env knobs (`docker/.env`, copied from `.env.example`)

Packages