Skip to content

serenedb/codesearch

Repository files navigation

SereneCodeSearch — code search over programming tasks

A small web app for searching programming tasks and their reference solutions, backed entirely by SereneDB inverted indexes:

  • Task search — BM25 full-text over problem statements, titles and tags.
  • Code search — substring (grep-grade, exact) and fuzzy ("shape of the code") search over solution source, powered by the sparse_ngram tokenizer (the GitHub code-search ngram scheme).
  • Best solutions per task — every task page lists its solutions, shortest first (a reasonable proxy for "best" since the dataset carries no runtime), with the shortest crowned 👑.
  • Hybrid search — reciprocal-rank fusion of a text retriever (concept) and a code retriever (idiom) in one query; tasks matched by both rank first.

No frameworks, no build step, no third-party packages — just Python's standard library and a hand-rolled Postgres-wire client (pgclient.py), so it runs on a bare Python 3 with nothing installed.

Quick start (Docker Compose)

The whole stack — database, index build, web UI, MCP endpoint and the optional semantic-search model — comes up with one command. Nothing has to be built or downloaded by hand and no dataset has to be staged locally: the db image fetches the official static serened release and every dataset is streamed straight from Hugging Face during the index build.

make run                      # build + start everything; site at http://localhost:8077

Or invoke Compose directly:

cd docker
cp .env.example .env          # tune ports / uid / build knobs (all optional)
docker compose up --build     # first run builds the entire corpus

Open http://localhost:8077 once it's up. The site comes online during the build (it shows a CREATE INDEX progress bar); search starts working as soon as the one-shot loader finishes.

What docker compose up does

It starts seven services (see docker/README.md for the service table and common operations):

  1. data-init — chowns the bind-mounted ../data dir (the derived embedding cache) to APP_UID:APP_GID, then exits.
  2. serened — the database. Downloads the official static release (SERENED_VERSION, default 26.06.2), persists its datadir in the sdb_data named volume, listens on 7899.
  3. ollama + ollama-init — serve and pull nomic-embed-text once (powers the semantic "Explore" search; skip with WITH_VECTORS=0).
  4. loader — the one-shot builder. Runs docker/fill.sh: build the text-search dictionaries → tasks_idx (sql_tasks_mega.sql) → task embeddings + HNSW → solutions_idx (sql_solutions_mega.sql), then exits ("fills and dies"). Idempotent.
  5. webserver.py on 8077.
  6. mcp — hosted MCP endpoint on 8079 (claude mcp add codesearch --transport http http://<host>:8079/mcp).

URLs and ports

Service URL / port Override in .env
Web UI http://localhost:8077 WEB_PORT
Database (pg-wire) 127.0.0.1:7899 DB_PORT
MCP endpoint http://localhost:8079/mcp MCP_PORT

Env knobs (docker/.env, copied from .env.example)

Variable Default Effect
SERENED_VERSION 26.06.2 release tag to download into the db image
WEB_PORT / DB_PORT / MCP_PORT 8077 / 7899 / 8079 published host ports
APP_UID / APP_GID 1000 run the app containers as this uid:gid so the embedding cache under data/ stays owned by you (use id -u / id -g)
WITH_VECTORS 1 set 0 to skip the embedding/HNSW build (disables semantic "Explore")
FORCE_REBUILD 0 set 1 to rebuild the indexes even if solutions_idx already has rows

Heads-up: the first run is a heavy build

The first docker compose up streams the full corpus from Hugging Face — notably the entire open-r1/codeforces-submissions accepted-solution split (≈11.4M solutions, filtered to source length 20–10000) plus the Codeforces / code_contests / MBPP / HumanEval / Rosetta Code / IOI task sets. Expect a long first run, and be aware that Hugging Face rate-limiting can slow or stall it (the SQL already retries with backoff). Plan for a multi-core machine, ~16 GB+ RAM and tens of GB of free disk.

Every subsequent up is instant: the datadir survives in the sdb_data volume, so the loader sees solutions_idx already populated and exits immediately. docker compose down keeps the volume; docker compose down -v wipes it for a clean rebuild.

Version note: 26.06.2 is the stock release. A columnstore fix for multi-row text fetch is in a pending SereneDB PR; bump SERENED_VERSION once it ships in a release.

Prerequisites

Just Docker + Docker Compose. No host toolchain, no manual serened build, no pre-staged dataset.

Dataset & licensing

The dataset is downloaded fresh by a script, never committed — and only from sources whose license clearly permits reuse. The default ("safe") set:

Dataset License What
MBPP CC-BY-4.0 974 authored Python tasks + reference solutions
HumanEval MIT 164 authored tasks + canonical solutions

Both are authored for the dataset, so bundling them is fine with attribution (recorded in data/DATASET_LICENSE.txt at fetch time).

--source large additionally pulls DeepMind code_contests (CC-BY-4.0) — ~3.7k more problems + the shortest few accepted Python solutions each, taking the corpus to ~4.9k tasks / ~20k solutions.

The full Codeforces corpus is what the Docker Compose build assembles. Its sql_*_mega.sql index build reads these straight from Hugging Face parquet:

Dataset License What
open-r1/codeforces CC-BY-4.0 ~9.5k problem statements (tasks)
open-r1/codeforces-submissions CC-BY-4.0 accepted submissions (solutions) — the full default split, ≈11.4M rows, filtered to source length 20–10000

The Docker build reads the entire submissions split — there is no sampling cap. Codeforces' terms permit republishing statements with attribution + a source link (shown per problem) as long as it is not an auto-judge. The unlicensed MatrixStudio/Codeforces-Python-Submissions set is deliberately not used.

The standalone fetch_dataset.py --source codeforces path is a much smaller statements-only fetch (open-r1/codeforces under ODC-By-1.0); it bundles no Codeforces solution corpus. The full solution corpus only comes via the Docker mega parquet build above.

CC-BY-4.0 / MIT are attribution-only — no share-alike, commercial use allowed — so showing the credit (below) is the whole obligation. The data is downloaded at build time and never committed, so this repo ships code + notices, not the datasets.

Running locally without Docker (manual alternative)

The Quick start above is the recommended way to run the whole stack. If you'd rather drive a SereneDB server you already have — e.g. against a local source build instead of the downloaded release — you can build a smaller, license-safe corpus by hand.

Server. For the demo, use the perf build (RelWithDebInfo, no sanitizers) — it loads and queries far faster than the debug build, which matters for the larger datasets:

cd /path/to/serenedb
cmake --preset perf -DCMAKE_C_COMPILER=clang-21 -DCMAKE_CXX_COMPILER=clang++-21
ninja -C build_perf serened
./build_perf/bin/serened /tmp/sdb_perf_data --server_endpoints='pgsql+tcp://0.0.0.0:7899'

(The debug build at ./build/bin/serened also works but is much slower.)

Corpus + app.

cd codesearch-site

# 1. download a license-safe dataset and load it (re-runnable any time)
./setup.sh                       # MBPP + HumanEval (safe default, ~1.1k)
#   ./setup.sh --source large        # + DeepMind code_contests (CC-BY-4.0)
#   ./setup.sh --source codeforces   # also add open-r1/codeforces statements (ODC-By, statements only)
#   ./setup.sh --db-port 7899        # point at a different server

# 2. serve the app
python3 server.py                # http://127.0.0.1:8077/
#   options: --port 8077  --db-host 127.0.0.1  --db-port 7899
#   --auto-setup : if no dataset is loaded, download + index it on startup

This local path does not build the full ~11.4M-solution Codeforces corpus — that only comes from the Docker sql_*_mega.sql build. To assemble the full corpus against your own server, run those scripts directly (see docker/fill.sh).

Attribution for whatever dataset is loaded is shown in the app (footer + the "Data & licenses" page at #/about, served from /api/licenses), written to data/DATASET_LICENSE.txt at fetch time, and recorded in THIRD_PARTY_DATA.md with full notices under licenses/.

setup.sh runs fetch_dataset.py (writes data/*.jsonl + license file) then load.sql, which creates the cf_en / code_grams / code_grams_q dictionaries and the tasks_idx / solutions_idx indexes. Both indexes are index-covering: every column is either an index key or an INCLUDE column, so result rows materialise from the index without a base-table read (confirmed by EXPLAIN: a single IRESEARCH_SCAN, no table scan).

Open http://127.0.0.1:8077/.

Load testing

load_test.py simulates active users against the JSON API with a realistic traffic mix: task searches, code searches, unified searches, result-detail opens, and occasional stats/health calls. By default --rpm-per-user auto means 40 requests per minute per active user.

python3 load_test.py --users 100 --rpm-per-user auto --duration 2m
# target: 100 users * 40 rpm = 4000 rpm total (~66.7 rps)

Useful knobs:

python3 load_test.py --base-url http://127.0.0.1:8077 --users 25 --rpm-per-user 30
python3 load_test.py --users 200 --duration 5m --ramp-up 30s

The script prints live throughput, status counts, endpoint mix, and final p50/p95/p99 latency numbers.

Layout

File Role
fetch_dataset.py Zero-dep downloader: pulls a license-safe dataset from the HF rows API, normalises it to data/tasks.jsonl + data/solutions.jsonl, writes data/DATASET_LICENSE.txt.
load.sql Ingests data/*.jsonl and builds the index-covering inverted indexes.
setup.sh Orchestrates fetch + load (re-runnable).
pgclient.py Zero-dependency Postgres v3 wire client (simple Query protocol) + bounded connection pool + SQL-literal escaping.
db.py Query layer: every search/detail SQL builder over the SereneDB indexes.
server.py stdlib ThreadingHTTPServer: JSON API under /api/*, static files from static/, per-IP rate limiting.
static/ Single-page app — index.html, app.js (hash router + views), style.css.

API

Endpoint Params Returns
GET /api/health health status
GET /api/stats corpus counts
GET /api/search/tasks q, min_rating?, max_rating?, limit? BM25-ranked tasks
GET /api/search/code q, mode=exact|fuzzy, min_rating?, max_rating?, limit? matching solutions
GET /api/search/hybrid text, code, limit? RRF-fused tasks
GET /api/task id (e.g. mbpp/601, humaneval/0, or 843/D) task + solutions (shortest first)
GET /api/solution id (integer) one solution with its task context

How search maps to SereneDB

  • Exact substring: tokenize the query with the covering sparse-ngram dictionary and require every gram (code @@ ts_all(ts_tokenize(ARRAY[q], 'code_grams_q'))), then a LIKE post-filter for exactness. The index narrows 30k rows to a handful; LIKE only verifies those.
  • Fuzzy: require a majority of the covering grams (ts_any(..., ceil(0.55 * grams))), BM25-ranked so the closest shape wins.
  • Hybrid: rank tasks by BM25 text relevance and by how many accepted solutions contain the code idiom, then fuse the two rank lists with RRF.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors