Skip to content

USHIFT-6808: C2CC Disruption Tests (Restarting Services, toggling ethernet)#6930

Open
pmtk wants to merge 10 commits into
openshift:mainfrom
pmtk:c2cc/chaos-testing/restarts
Open

USHIFT-6808: C2CC Disruption Tests (Restarting Services, toggling ethernet)#6930
pmtk wants to merge 10 commits into
openshift:mainfrom
pmtk:c2cc/chaos-testing/restarts

Conversation

@pmtk

@pmtk pmtk commented Jun 24, 2026

Copy link
Copy Markdown
Member

Summary by CodeRabbit

  • New Features

    • Added disruptive cluster-resilience tests that simulate VM, network, and service failures and verify recovery.
    • Added broader cross-cluster checks for connectivity, DNS, and cluster health after disruptions.
    • Added new bootc scenario runners for deterministic disruptive test execution.
  • Bug Fixes

    • Improved cluster-state validation to handle the full set of clusters dynamically.
    • Strengthened recovery handling by restoring network access and re-establishing cluster connections during teardown.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 24, 2026
@openshift-ci-robot

openshift-ci-robot commented Jun 24, 2026

Copy link
Copy Markdown

@pmtk: This pull request references USHIFT-6808 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

The PR adds disruptive C2CC test coverage, shared Robot helpers for cluster validation and recovery, shell runner updates for test filtering, and new bootc scenario scripts that execute the disruptive suite.

Changes

C2CC disruptive test flow

Layer / File(s) Summary
Shared C2CC helpers
test/resources/c2cc.resource
Adds cluster CIDR maps and keywords for IP-rule checks, RemoteCluster state and unhealthy detection, infrastructure verification, connectivity, DNS, VM NIC disruption, and remote-cluster reconnect handling.
Probe keyword update
test/suites/c2cc/probe.robot
Replaces a RemoteCluster state lookup keyword with an errors lookup keyword and removes the IP-to-name helper.
Common runner wiring
test/bin/c2cc_common.sh
Extends c2cc_run_tests with include-tag forwarding and additional host VM variables, and updates the VM creation helper’s ip_family declaration.
Scenario entrypoints
test/scenarios-bootc/c2cc/el98-src@c2cc-disruptive.sh, test/scenarios-bootc/c2cc/el102-src@c2cc-disruptive.sh
Adds bootc scenario scripts that disable test randomization, manage VM lifecycle, and invoke the disruptive suite after host configuration.
Disruptive suite behavior
test/suites/c2cc/disruptive.robot
Adds the disruptive suite setup and teardown, recovery timing, cluster health checks, failure-inducing test cases, recovery verification, and NIC restore/reconnect handling.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • openshift/microshift#6894: Changes test/bin/c2cc_common.sh in the same c2cc_run_tests argument-building area, including forwarding additional --variable/--include values.

Suggested reviewers

  • eslutsky
  • copejon
🚥 Pre-merge checks | ✅ 13 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Ipv6 And Disconnected Network Test Compatibility ⚠️ Warning New C2CC tests hardcode IPv4 pod/service CIDRs (10.42.0.0/16 etc.) and lack IPv6 CIDR fallback, so IPv6-only CI is incompatible. Use family-aware CIDRs (e.g. correctCIDRFamily) or skip these tests on IPv6-only jobs; keep public-internet assumptions out of e2e paths.
✅ Passed checks (13 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title matches the main change: new C2CC disruption tests around service restarts and network/ethernet toggling.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PASS: The patch adds Robot suite cases only; no It/Describe/Context/When titles or dynamic values appear in the changed tests.
Test Structure And Quality ✅ Passed PASS: This PR adds Robot tests, not Ginkgo; they use Suite Setup/Teardown or test [Setup]/[Teardown], bounded waits, and cleanup for created resources.
Microshift Test Compatibility ✅ Passed No new Go/Ginkgo e2e tests were added; the touched files are Bash/Robot Framework only, and no g.It/Describe/Context/When use appears in those paths.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No new Go/Ginkgo e2e tests were added; this PR only changes Robot/Bash C2CC suites, so the SNO Ginkgo check is not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed Only test/support files changed; no manifests, controllers, or scheduling constraints were introduced.
Ote Binary Stdout Contract ✅ Passed No changed process-level entrypoints add stdout writes; the PR only touches shell/Robot resources, and c2cc_common.sh echoes errors to stderr.
No-Weak-Crypto ✅ Passed Touched files contain no MD5/SHA1/DES/RC4/3DES/Blowfish/ECB, no custom crypto, and no secret/token comparisons; the only crypto use is openssl rand for PSK generation.
Container-Privileges ✅ Passed Edited files are shell/Robot tests only; no K8s/container manifests or privileged settings (privileged, host*, SYS_ADMIN, allowPrivilegeEscalation, runAsUser:0) were added.
No-Sensitive-Data-In-Logs ✅ Passed No added logging of passwords/tokens/PII found; the new disruptive scripts only wire existing helpers and do not introduce sensitive log output.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 24, 2026
@openshift-ci

openshift-ci Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai coderabbitai Bot added the ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review label Jun 24, 2026
@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 24, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/bin/c2cc_common.sh`:
- Around line 186-193: The `setup_hosts` logic is handling `get_host_ip`
failures with early returns, but `full_vm_name` for `host2_vm` and `host3_vm` is
not checked the same way. Update the assignments around `full_vm_name`,
`host2_vm`, and `host3_vm` so a failure aborts immediately instead of
propagating empty `HOST2_VM_NAME`/`HOST3_VM_NAME` values into the NIC chaos
keywords. Keep the error handling consistent with the existing `get_host_ip`
pattern by returning from the function on lookup failure before the `readonly`
exports.

In `@test/resources/c2cc.resource`:
- Around line 562-563: The deregistration cleanup currently always calls
SSHLibrary.Close Connection even when SSHLibrary.Switch Connection fails, which
can close the wrong active SSH session on повторное deregistration. Update the
teardown logic in the affected resource so Close Connection only runs when the
switch to ${alias} succeeds, using the existing SSHLibrary.Switch Connection and
SSHLibrary.Close Connection keywords together with a conditional or status check
to preserve the current session when the alias is missing.
- Around line 541-549: In Disable All NICs For VM, add a fail-fast guard after
Get Vnet Devices For MicroShift Host returns ${vnet_ifaces} so the test errors
immediately when no vnet devices are found instead of silently doing nothing.
Check for an empty list before the FOR loop, and raise a clear failure that
includes ${vm_name} and the lack of discovered NICs so the issue is obvious in
the test output.
- Around line 483-488: The precondition gate in Ensure All Clusters Healthy only
verifies RemoteCluster CR status and can let a degraded local MicroShift/OVN-K
environment pass. Update this keyword to run the actual cluster healthcheck
before fault injection, alongside or instead of Verify RemoteCluster State, so
the setup blocks unless the real cluster health is healthy for each alias
(cluster-a, cluster-b, cluster-c).
- Around line 170-184: The IP rule checks in Verify Service IP Rules and Verify
All IP Rules only validate the rule targets, so they can miss wrong ordering.
Update these Robot Framework keywords to also assert the expected priority value
for each rule returned by Command On Cluster, using the existing `${stdout}`
checks with Should Contain against the full rule strings so table 200/201
entries are verified at the correct priority.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 9ab20d10-7e3c-468d-a472-1cea8e83ce11

📥 Commits

Reviewing files that changed from the base of the PR and between bfeabef and 64e01f1.

📒 Files selected for processing (6)
  • test/bin/c2cc_common.sh
  • test/resources/c2cc.resource
  • test/scenarios-bootc/c2cc/el98-src@c2cc-chaos.sh
  • test/suites/c2cc/chaos.robot
  • test/suites/c2cc/cleanup.robot
  • test/suites/c2cc/probe.robot
💤 Files with no reviewable changes (2)
  • test/suites/c2cc/cleanup.robot
  • test/suites/c2cc/probe.robot

Comment thread test/bin/c2cc_common.sh
Comment thread test/resources/c2cc.resource
Comment thread test/resources/c2cc.resource Outdated
Comment thread test/resources/c2cc.resource
Comment thread test/resources/c2cc.resource Outdated
pmtk added a commit to pmtk/microshift that referenced this pull request Jun 24, 2026
- c2cc_common.sh: add || return 1 to full_vm_name calls for consistency
- c2cc.resource: fail fast when no vnet interfaces found for VM
- c2cc.resource: only close SSH connection after successful switch

Co-Authored-By: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
pmtk added a commit to pmtk/microshift that referenced this pull request Jun 24, 2026
- c2cc_common.sh: add || return 1 to full_vm_name calls for consistency
- c2cc.resource: fail fast when no vnet interfaces found for VM
- c2cc.resource: only close SSH connection after successful switch
@pmtk pmtk force-pushed the c2cc/chaos-testing/restarts branch from 30371f0 to 0d99900 Compare June 24, 2026 13:41
@pmtk

pmtk commented Jun 24, 2026

Copy link
Copy Markdown
Member Author

/test ?

@pmtk

pmtk commented Jun 24, 2026

Copy link
Copy Markdown
Member Author

/test e2e-aws-tests-bootc-c2cc e2e-aws-tests-bootc-c2cc-arm

pmtk added 5 commits June 24, 2026 18:57
MicroShift, NetworkManager, OVN-K restarts.
NIC disabling.
- c2cc_common.sh: add || return 1 to full_vm_name calls for consistency
- c2cc.resource: fail fast when no vnet interfaces found for VM
- c2cc.resource: only close SSH connection after successful switch
@pmtk pmtk force-pushed the c2cc/chaos-testing/restarts branch from 0d99900 to bcfed52 Compare June 24, 2026 16:58
@pmtk

pmtk commented Jun 24, 2026

Copy link
Copy Markdown
Member Author

/test e2e-aws-tests-bootc-c2cc e2e-aws-tests-bootc-c2cc-arm

@pmtk pmtk marked this pull request as ready for review June 24, 2026 17:01
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 24, 2026
@openshift-ci openshift-ci Bot requested review from jogeo and vanhalenar June 24, 2026 17:03

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/suites/c2cc/chaos.robot`:
- Line 55: The cluster command in the chaos test is too disruptive for Command
On Cluster because restarting NetworkManager can drop the SSH session before the
call returns. Update the test in chaos.robot to use Disruptive Command On
Cluster for the systemctl restart NetworkManager step, using the existing
cluster-c command block so the session loss is handled safely.
- Around line 64-76: The disabled-state flag is set too late in the chaos test
flow, so a failure inside Disable All NICs For VM can leave NICs down while
Restore NICs And Reconnect sees no disabled VM to clean up. In the c2cc chaos
scenario, set ${DISABLED_VM} before calling Disable All NICs For VM and keep the
existing reset after recovery; use the existing keywords Disable All NICs For
VM, Restore NICs And Reconnect, and ${DISABLED_VM} to locate the change.
- Around line 149-157: The setup flow is missing validation for VM name
environment variables, which lets NIC-outage tests fail later with unclear
errors. Update Check Required Env Variables to also require HOST2_VM_NAME and
HOST3_VM_NAME alongside the existing host/IP/port/kubeconfig checks, and ensure
the Setup sequence still calls that validation before Register Remote Cluster so
failures surface early with a clear message.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 8b370a02-f8f1-4408-b570-8eb1581bfc0e

📥 Commits

Reviewing files that changed from the base of the PR and between 0d99900 and bcfed52.

📒 Files selected for processing (5)
  • test/bin/c2cc_common.sh
  • test/resources/c2cc.resource
  • test/scenarios-bootc/c2cc/el98-src@c2cc-chaos.sh
  • test/suites/c2cc/chaos.robot
  • test/suites/c2cc/probe.robot
💤 Files with no reviewable changes (1)
  • test/suites/c2cc/probe.robot
🚧 Files skipped from review as they are similar to previous changes (2)
  • test/bin/c2cc_common.sh
  • test/resources/c2cc.resource

Comment thread test/suites/c2cc/disruptive.robot
Comment thread test/suites/c2cc/disruptive.robot
Comment thread test/suites/c2cc/disruptive.robot
Comment thread test/scenarios-bootc/c2cc/el98-src@c2cc-disruptive.sh
Comment thread test/suites/c2cc/chaos.robot Outdated
Comment thread test/suites/c2cc/disruptive.robot Outdated
Comment thread test/suites/c2cc/chaos.robot Outdated
@pmtk pmtk changed the title USHIFT-6808: C2CC Chaos Tests (Restarting Services, toggling ethernet) USHIFT-6808: C2CC Disruption Tests (Restarting Services, toggling ethernet) Jun 25, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/suites/c2cc/chaos.robot`:
- Around line 58-60: Persist the disabled NIC list before calling Disable All
NICs For VM in the NIC-outage test flow so teardown always has recovery data.
Update the chaos.robot test cases that use Restore NICs And Reconnect and the
@{DISABLED_IFACES} variable so the interface list is assigned to test scope
before the disruptive keyword runs, ensuring it remains available even if
Disable All NICs For VM fails partway through.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 6125ba91-d4ee-44ad-ba23-e69b06d6e318

📥 Commits

Reviewing files that changed from the base of the PR and between bcfed52 and 696cfdb.

📒 Files selected for processing (3)
  • test/resources/c2cc.resource
  • test/scenarios-bootc/c2cc/el102-src@c2cc-chaos.sh
  • test/suites/c2cc/chaos.robot
💤 Files with no reviewable changes (1)
  • test/resources/c2cc.resource

Comment thread test/suites/c2cc/disruptive.robot

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
test/suites/c2cc/disruptive.robot (1)

58-60: 🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy lift

NIC recovery state is still not failure-safe

If Disable All NICs For VM fails before returning (Line 58 / Line 96), ${DISABLED_VM} and @{DISABLED_IFACES} are never populated, so teardown (Line 145-148) skips or cannot restore NICs. Please move recovery-state capture into a failure-safe path (ideally inside the keyword with try/finally-style handling) so teardown can always re-enable interfaces after partial disruption.

Based on learnings: in this C2CC NIC-outage flow, setting only ${DISABLED_VM} earlier is insufficient; teardown must reliably receive disabled interface state even on failure paths.

Also applies to: 96-99, 145-148

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/suites/c2cc/disruptive.robot` around lines 58 - 60, The NIC recovery
state setup is not failure-safe because `${DISABLED_VM}` and
`@{DISABLED_IFACES}` are only populated after `Disable All NICs For VM` returns,
so failures leave teardown without the information it needs. Move the
disabled-VM/interface capture into a failure-safe path inside `Disable All NICs
For VM` (or wrap it with try/finally-style handling in the disruptive flow) so
the disabled interface state is always preserved and available to the
teardown/recovery logic in `disruptive.robot`.

Source: Learnings

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@test/suites/c2cc/disruptive.robot`:
- Around line 58-60: The NIC recovery state setup is not failure-safe because
`${DISABLED_VM}` and `@{DISABLED_IFACES}` are only populated after `Disable All
NICs For VM` returns, so failures leave teardown without the information it
needs. Move the disabled-VM/interface capture into a failure-safe path inside
`Disable All NICs For VM` (or wrap it with try/finally-style handling in the
disruptive flow) so the disabled interface state is always preserved and
available to the teardown/recovery logic in `disruptive.robot`.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 4677d390-5c12-4593-b462-d134f95c615c

📥 Commits

Reviewing files that changed from the base of the PR and between 696cfdb and 5c03ba7.

📒 Files selected for processing (3)
  • test/scenarios-bootc/c2cc/el102-src@c2cc-disruptive.sh
  • test/scenarios-bootc/c2cc/el98-src@c2cc-disruptive.sh
  • test/suites/c2cc/disruptive.robot

Comment thread test/suites/c2cc/disruptive.robot Outdated
Comment thread test/suites/c2cc/disruptive.robot Outdated
Comment thread test/suites/c2cc/disruptive.robot Outdated
Comment thread test/suites/c2cc/disruptive.robot Outdated
Comment thread test/suites/c2cc/disruptive.robot Outdated
Comment thread test/suites/c2cc/disruptive.robot Outdated
Comment thread test/suites/c2cc/disruptive.robot Outdated
Comment thread test/suites/c2cc/disruptive.robot Outdated
Comment thread test/suites/c2cc/disruptive.robot Outdated
Comment thread test/suites/c2cc/disruptive.robot Outdated
Comment thread test/suites/c2cc/disruptive.robot Outdated
Comment thread test/suites/c2cc/disruptive.robot Outdated
Comment thread test/suites/c2cc/disruptive.robot Outdated
@agullon

agullon commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

/lgtm

@agullon

agullon commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

/verified by CI

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Jun 25, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@agullon: This PR has been marked as verified by CI.

Details

In response to this:

/verified by CI

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 25, 2026
@openshift-ci

openshift-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: agullon, pmtk

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants