Skip to content

Add Configuration Maximums page for Kubermatic Virtualization#2190

Open
mihiragrawal wants to merge 20 commits into
kubermatic:mainfrom
mihiragrawal:configmax-configuration-maximums
Open

Add Configuration Maximums page for Kubermatic Virtualization#2190
mihiragrawal wants to merge 20 commits into
kubermatic:mainfrom
mihiragrawal:configmax-configuration-maximums

Conversation

@mihiragrawal

Copy link
Copy Markdown

What

New documentation page: Configuration Maximums for Kubermatic Virtualization, under content/kubermatic-virtualization/main/configuration-maximums/. It documents the validated configuration maximums of a KubeV cluster — VMs, networks, firewall rules, subnets, routes, etc. — as discovered by the in-cluster ConfigMax benchmark tool.

Lands in main (next release); v1.1.0 is frozen.

What's in it

  1. Validated maximums — customer-facing table. Embargo-clean (no VMware/vSphere naming). Uses an Accepted (objects stored) vs Sustained (objects in place while VM-to-VM latency stayed within 5 % of baseline) split so the headline numbers stay visible but honest.
  2. Engineering reference — internal table with the full VMware vSphere ConfigMax comparison and measurement provenance. Marked as internal / may be trimmed before public release.
  3. How we measure — discovery / target / workload-SLO run modes + the distress signals that stop a run.
  4. What each number is limited by — one-line bottleneck per capability.
  5. Run it yourself — prerequisites, one annotated ConfigMaxRun YAML, apply/watch/read commands, and a key-parameters reference.

For the reviewer (@Moath — decisions needed)

  • Engineering table: keep it on the public page, or trim to the customer table only?
  • "Validation in progress" rows (attachment templates, routable services, QoS policies, VMs-per-host): include now or omit until re-validated?
  • Any numbers to soften before this is customer-visible.

Verification

Built locally with the project's Dockerized Hugo (quay.io/kubermatic/hugo:0.159.1-0), exit 0, no errors. Page renders and appears in the main section nav.

🤖 Generated with Claude Code

Document the validated configuration maximums of a KubeV cluster as
discovered by the ConfigMax benchmark tool. Includes:

- a customer-facing "Validated maximums" table (accepted vs sustained),
- an internal engineering reference with the vSphere ConfigMax comparison
  and full measurement provenance (marked for review/trim),
- a "How we measure" section covering discovery / target / workload-SLO
  modes and the distress signals that stop a run,
- a per-capability bottleneck summary,
- a "Run it yourself" guide with an annotated ConfigMaxRun YAML and the
  key parameters reference.

Lands in content/kubermatic-virtualization/main (next release).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@kubermatic-bot kubermatic-bot added dco-signoff: yes Denotes that all commits in the pull request have the valid DCO signoff message. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 3, 2026
@kubermatic-bot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign iammerus for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

… methodology

Replace the accepted/sustained model with the programmed + verified-functional
ceiling dataset (June 2026 runs), split the page into a public part (marketing
table, technical reference, run-it-yourself incl. CLI) and an internal
engineering reference (methodology, distress probes, per-test method cards,
tuning baseline, bottleneck/caveats registers, journey, glossary) behind a
review divider.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@kubermatic-bot kubermatic-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 10, 2026
mihiragrawal and others added 14 commits June 10, 2026 15:48
Marketing and technical tables now lead with the actual resource
(VPCs, Subnets, NetworkPolicies, SecurityGroups, Services); each
description opens with the platform-neutral concept line for readers
from other virtualization stacks.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The What-it-means column now carries just the bolded platform-neutral
concept per row; detail lives in the technical reference table.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Run dates and durations remain in the per-test method cards.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Replace the qualitative 4x description with the per-run p99 numbers
from the 2026-05-08 validation runs (cross-host 1.6 ms to 6.9 ms at the
cliff, 3-8 ms sustained band, reproduced in the second run).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
State the full measured curve: 1.5-1.75 ms cross-host p99 through 70
tenants, 6.9 ms at 80, 3-8 ms band from 90-120, with the cliff points
of all three validation runs.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Baseline, steady range, cliff, and degraded band each get a row with
tenant count and measured cross-host p99, replacing the single dense
sentence.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Values from the v8a per-batch trace; baseline same-host honestly marked
not-recorded (only cross-host was captured in that run). Notes that the
same-host cliff arrives one batch after the cross-host one.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Drop the tracked-connections row (a Linux kernel parameter, not a
product capability — data stays in the internal method cards); label
the latency row as the idle-cluster floor; move the ~80 active-tenants
degradation result out of the capacity table with a footnote pointer to
the degradation section.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The re-run with an independent apiserver-VIP bystander probe reached
the full 120,000 cap cleanly: 120,001 policies / 355,469 ACLs settled,
zero probe failures in ~4 h. The earlier 25,101 stop is confirmed as a
one-off harness-pod network loss, not a data-plane wall. Updates the
marketing and technical rows, the method card, bottleneck and caveats
registers, and adds the journey entry.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The page was marked chapter = true, which the theme styles with
enlarged chapter-intro typography — the whole page rendered with a
larger font than sibling docs. Drop the chapter flag and the manual
H1 (the theme renders the title) so it matches the other content
pages and gains the standard in-page TOC.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Restructure the single long page into a section with sub-pages,
matching the Architecture section's layout:

  configuration-maximums/        overview, test environment, reading guide
    validated-maximums/          headline table + technical reference
    degradation/                 tenant-scaling degradation result
    running-configmax/           prerequisites
      operator/                  ConfigMaxRun walkthrough, parameters, profiles
      cli/                       standalone binary
    engineering-reference/       internal-review warning + map
      ceiling-methodology/       definition, run loop, distress probes
      degradation-methodology/
      method-cards/              per-test method cards
      tuning-and-findings/       tuning baseline, bottleneck/caveats registers
      glossary/

Content is moved verbatim; only cross-references changed from in-page
anchors to ref shortcode links between the new pages. Refs use ./ and
../ relative paths because the theme's ref shortcode resolves bare
names site-wide and errors on ambiguous section names (e.g. "cli").

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Per review feedback: the technical reference table now carries the data
each run actually collected - run date, duration, and peak component
readings against their danger lines (etcd database fill, control-plane
database memory, rule-compiler CPU, host memory, programming pace,
probe results) - so a technical reader can see what the cluster was
doing the moment each ceiling was recorded. Adds the shared danger-line
legend and a pointer to the per-test method cards for the full
named-component data.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ference

Per review: the technical reference table's audience is technical
readers, so the measured-data column now names the actual components
(ovn-central, ovn-northd, ovs-ovn, kube-ovn-controller, ACLs, OVN LB
VIPs) instead of platform-neutral paraphrases. Add the latency
observations each run captured (gateway-ping RTT at 11.8k subnets,
VPC provisioning latency) and a pod-to-pod latency row with the full
idle-cluster measurement, including the 14k-orphan-subnet contrast
that motivates the clean-cluster rule. VM-to-VM latency under tenant
load stays on the Degradation page, referenced below the table.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…lidraw diagram

The two probes listed as missing (per-node ovs-ovn memory, kube-ovn-controller
restart/crash) are now implemented and active in every run; the probe table
gains both rows and the honest-list entry records how they were validated
(forced-trip run + a full 10k-VPC ceiling re-run with zero false trips and an
unchanged settled count). The ASCII run-loop block becomes the Excalidraw
render per the diagram rule.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@kubermatic-bot kubermatic-bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 12, 2026
mihiragrawal and others added 4 commits June 12, 2026 14:33
Readers need the current behavior, not the chronology of past mistakes and
their fix dates. Probe history becomes present-tense design notes; method-card
and tuning caveats keep the load-bearing facts (requirements, scaling
rationale, safety warnings) and drop the earlier-run stories. Measurement
provenance (run dates on published numbers) stays.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 D1 run supersedes the ~80±10 cliff: 120 tenants / 600 VMs
(run cap) with flat VM-to-VM p99 (406-488us vs 430us baseline), all
24 boundaries measured. The old cliff traced to memory-starved
per-node networking agents + non-enforcing policies — kept as a
superseded section with the root cause.

Methodology page now documents the dual stop rule (2 ms floor AND
4x own baseline), probe-lost abort, and baseline completeness
guards; method card, validated-maximums notice, and journey table
updated to match.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Readers need the current result, not the story of how it was
reached: dropped the superseded-cliff section, validation dates in
prose, and roadmap chatter; sizing lessons kept as plain guidance.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Internal cluster name removed from the published pages.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Denotes that all commits in the pull request have the valid DCO signoff message. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants