Add Configuration Maximums page for Kubermatic Virtualization by mihiragrawal · Pull Request #2190 · kubermatic/docs

mihiragrawal · 2026-06-03T09:20:39Z

What

New documentation page: Configuration Maximums for Kubermatic Virtualization, under content/kubermatic-virtualization/main/configuration-maximums/. It documents the validated configuration maximums of a KubeV cluster — VMs, networks, firewall rules, subnets, routes, etc. — as discovered by the in-cluster ConfigMax benchmark tool.

Lands in main (next release); v1.1.0 is frozen.

What's in it

Validated maximums — customer-facing table. Embargo-clean (no VMware/vSphere naming). Uses an Accepted (objects stored) vs Sustained (objects in place while VM-to-VM latency stayed within 5 % of baseline) split so the headline numbers stay visible but honest.
Engineering reference — internal table with the full VMware vSphere ConfigMax comparison and measurement provenance. Marked as internal / may be trimmed before public release.
How we measure — discovery / target / workload-SLO run modes + the distress signals that stop a run.
What each number is limited by — one-line bottleneck per capability.
Run it yourself — prerequisites, one annotated ConfigMaxRun YAML, apply/watch/read commands, and a key-parameters reference.

For the reviewer (@Moath — decisions needed)

Engineering table: keep it on the public page, or trim to the customer table only?
"Validation in progress" rows (attachment templates, routable services, QoS policies, VMs-per-host): include now or omit until re-validated?
Any numbers to soften before this is customer-visible.

Verification

Built locally with the project's Dockerized Hugo (quay.io/kubermatic/hugo:0.159.1-0), exit 0, no errors. Page renders and appears in the main section nav.

🤖 Generated with Claude Code

Document the validated configuration maximums of a KubeV cluster as discovered by the ConfigMax benchmark tool. Includes: - a customer-facing "Validated maximums" table (accepted vs sustained), - an internal engineering reference with the vSphere ConfigMax comparison and full measurement provenance (marked for review/trim), - a "How we measure" section covering discovery / target / workload-SLO modes and the distress signals that stop a run, - a per-capability bottleneck summary, - a "Run it yourself" guide with an annotated ConfigMaxRun YAML and the key parameters reference. Lands in content/kubermatic-virtualization/main (next release). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

kubermatic-bot · 2026-06-03T09:21:12Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign iammerus for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

… methodology Replace the accepted/sustained model with the programmed + verified-functional ceiling dataset (June 2026 runs), split the page into a public part (marketing table, technical reference, run-it-yourself incl. CLI) and an internal engineering reference (methodology, distress probes, per-test method cards, tuning baseline, bottleneck/caveats registers, journey, glossary) behind a review divider. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Marketing and technical tables now lead with the actual resource (VPCs, Subnets, NetworkPolicies, SecurityGroups, Services); each description opens with the platform-neutral concept line for readers from other virtualization stacks. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The What-it-means column now carries just the bolded platform-neutral concept per row; detail lives in the technical reference table. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Run dates and durations remain in the per-test method cards. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Replace the qualitative 4x description with the per-run p99 numbers from the 2026-05-08 validation runs (cross-host 1.6 ms to 6.9 ms at the cliff, 3-8 ms sustained band, reproduced in the second run). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

State the full measured curve: 1.5-1.75 ms cross-host p99 through 70 tenants, 6.9 ms at 80, 3-8 ms band from 90-120, with the cliff points of all three validation runs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Baseline, steady range, cliff, and degraded band each get a row with tenant count and measured cross-host p99, replacing the single dense sentence. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Values from the v8a per-batch trace; baseline same-host honestly marked not-recorded (only cross-host was captured in that run). Notes that the same-host cliff arrives one batch after the cross-host one. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Drop the tracked-connections row (a Linux kernel parameter, not a product capability — data stays in the internal method cards); label the latency row as the idle-cluster floor; move the ~80 active-tenants degradation result out of the capacity table with a footnote pointer to the degradation section. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The re-run with an independent apiserver-VIP bystander probe reached the full 120,000 cap cleanly: 120,001 policies / 355,469 ACLs settled, zero probe failures in ~4 h. The earlier 25,101 stop is confirmed as a one-off harness-pod network loss, not a data-plane wall. Updates the marketing and technical rows, the method card, bottleneck and caveats registers, and adds the journey entry. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The page was marked chapter = true, which the theme styles with enlarged chapter-intro typography — the whole page rendered with a larger font than sibling docs. Drop the chapter flag and the manual H1 (the theme renders the title) so it matches the other content pages and gains the standard in-page TOC. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Restructure the single long page into a section with sub-pages, matching the Architecture section's layout: configuration-maximums/ overview, test environment, reading guide validated-maximums/ headline table + technical reference degradation/ tenant-scaling degradation result running-configmax/ prerequisites operator/ ConfigMaxRun walkthrough, parameters, profiles cli/ standalone binary engineering-reference/ internal-review warning + map ceiling-methodology/ definition, run loop, distress probes degradation-methodology/ method-cards/ per-test method cards tuning-and-findings/ tuning baseline, bottleneck/caveats registers glossary/ Content is moved verbatim; only cross-references changed from in-page anchors to ref shortcode links between the new pages. Refs use ./ and ../ relative paths because the theme's ref shortcode resolves bare names site-wide and errors on ambiguous section names (e.g. "cli"). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Per review feedback: the technical reference table now carries the data each run actually collected - run date, duration, and peak component readings against their danger lines (etcd database fill, control-plane database memory, rule-compiler CPU, host memory, programming pace, probe results) - so a technical reader can see what the cluster was doing the moment each ceiling was recorded. Adds the shared danger-line legend and a pointer to the per-test method cards for the full named-component data. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…ference Per review: the technical reference table's audience is technical readers, so the measured-data column now names the actual components (ovn-central, ovn-northd, ovs-ovn, kube-ovn-controller, ACLs, OVN LB VIPs) instead of platform-neutral paraphrases. Add the latency observations each run captured (gateway-ping RTT at 11.8k subnets, VPC provisioning latency) and a pod-to-pod latency row with the full idle-cluster measurement, including the 14k-orphan-subnet contrast that motivates the clean-cluster rule. VM-to-VM latency under tenant load stays on the Degradation page, referenced below the table. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…lidraw diagram The two probes listed as missing (per-node ovs-ovn memory, kube-ovn-controller restart/crash) are now implemented and active in every run; the probe table gains both rows and the honest-list entry records how they were validated (forced-trip run + a full 10k-VPC ceiling re-run with zero false trips and an unchanged settled count). The ASCII run-loop block becomes the Excalidraw render per the diagram rule. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Readers need the current behavior, not the chronology of past mistakes and their fix dates. Probe history becomes present-tense design notes; method-card and tuning caveats keep the load-bearing facts (requirements, scaling rationale, safety warnings) and drop the earlier-run stories. Measurement provenance (run dates on published numbers) stays. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

2026-06-12 D1 run supersedes the ~80±10 cliff: 120 tenants / 600 VMs (run cap) with flat VM-to-VM p99 (406-488us vs 430us baseline), all 24 boundaries measured. The old cliff traced to memory-starved per-node networking agents + non-enforcing policies — kept as a superseded section with the root cause. Methodology page now documents the dual stop rule (2 ms floor AND 4x own baseline), probe-lost abort, and baseline completeness guards; method card, validated-maximums notice, and journey table updated to match. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Readers need the current result, not the story of how it was reached: dropped the superseded-cliff section, validation dates in prose, and roadmap chatter; sizing lessons kept as plain guidance. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Internal cluster name removed from the published pages. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

kubermatic-bot added dco-signoff: yes Denotes that all commits in the pull request have the valid DCO signoff message. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 3, 2026

kubermatic-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 10, 2026

mihiragrawal and others added 14 commits June 10, 2026 15:48

Slim the validated-maximums table to concept-only descriptions

456a93d

The What-it-means column now carries just the bolded platform-neutral concept per row; detail lives in the technical reference table. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Drop the Run column from the technical reference table

01e76f9

Run dates and durations remain in the per-test method cards. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Anchor degradation latency figures to tenant counts

ba524a6

State the full measured curve: 1.5-1.75 ms cross-host p99 through 70 tenants, 6.9 ms at 80, 3-8 ms band from 90-120, with the cliff points of all three validation runs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Present the degradation curve as a staged table

d47b284

Baseline, steady range, cliff, and degraded band each get a row with tenant count and measured cross-host p99, replacing the single dense sentence. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

kubermatic-bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 12, 2026

mihiragrawal and others added 4 commits June 12, 2026 14:33

Refer to the test cluster as "reference cluster" only

50cd871

Internal cluster name removed from the published pages. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Configuration Maximums page for Kubermatic Virtualization#2190

Add Configuration Maximums page for Kubermatic Virtualization#2190
mihiragrawal wants to merge 20 commits into
kubermatic:mainfrom
mihiragrawal:configmax-configuration-maximums

mihiragrawal commented Jun 3, 2026

Uh oh!

kubermatic-bot commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mihiragrawal commented Jun 3, 2026

What

What's in it

For the reviewer (@Moath — decisions needed)

Verification

Uh oh!

kubermatic-bot commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants