Skip to content

Add wallclock TaskBlock profiler support#11544

Draft
kaahos wants to merge 6 commits into
masterfrom
paul.fournillon/wallclock-taskblock
Draft

Add wallclock TaskBlock profiler support#11544
kaahos wants to merge 6 commits into
masterfrom
paul.fournillon/wallclock-taskblock

Conversation

@kaahos

@kaahos kaahos commented Jun 3, 2026

Copy link
Copy Markdown

What Does This Do

Add support for wall-clock TaskBlock events, including span-context attribution and native event recording for blocking intervals such as Object.wait and synchronized contention.

Motivation

Make blocking time visible as structured datadog.TaskBlock events instead of relying only on sampled wall-clock frames. It allows to reduce signal emissions.

Additional Notes

Foundation for follow-up LockSupport and Thread.sleep TaskBlock instrumentation.

Contributor Checklist

  • Format the title according to the contribution guidelines
  • Assign the type: and (comp: or inst:) labels in addition to any other useful labels
  • Avoid using close, fix, or any linking keywords when referencing an issue
    Use solves instead, and assign the PR milestone to the issue
  • Update the CODEOWNERS file on source file addition, migration, or deletion
  • Update public documentation with any new configuration flags or behaviors
  • Add your completed PR to the merge queue by commenting /merge. You can also:
    • Customize the commit message associated with the merge with /merge --commit-message "..."
    • Remove your PR from the merge queue with /merge -c
    • Skip all merge queue checks with /merge -f --reason "reason"; please use this judiciously, as some checks do not run at the PR-level
    • Get more information in this doc

Jira ticket: [PROJ-IDENT]

@datadog-datadog-prod-us1-2

This comment has been minimized.

Comment thread .gitlab-ci.yml

JAVA_PROFILER_REF:
description: "When non-empty, clone DataDog/java-profiler at this Git ref (branch or tag), build ddprof, and use it as ddprof.jar for Gradle jobs instead of the Maven dependency."
value: "paul.fournillon/wallclock-taskblock"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Sphinx Review — HIGH] JAVA_PROFILER_REF defaults to a personal feature branch ('paul.fournillon/wallclock-taskblock'). The build_java_profiler_ddprof job fires whenever the value is non-empty, so by default the entire pipeline builds and injects ddprof from this personal branch instead of the published artifact. This includes deploy_to_maven_central and deploy_snapshot_with_ddprof_snapshot. If the personal branch is deleted, force-pushed, or renamed, 'git clone --branch' fails and breaks all builds.

Suggestion: Set the default value to an empty string before merge (keep the override mechanism for ad-hoc testing). Apply the same change to env JAVA_PROFILER_REF in .github/workflows/run-system-tests.yaml.

[CONSENSUS · confidence: high · adversary-tested]

* @param spanId span ID captured at sleep entry
* @param rootSpanId root span ID captured at sleep entry
*/
default void enqueueTaskBlock(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Sphinx Review — MEDIUM] enqueueTaskBlock (interface default) and DatadogProfiler.recordTaskBlockFromContextEvent / TaskBlockBridge.recordTaskBlockFromContext have zero callers anywhere in the repo. The Javadoc claims enqueueTaskBlock is 'Called from the Thread.sleep instrumentation finish path', but that path does not exist in this PR. The entire new TaskBlock bridge surface (blockEnter/blockExit/parkEnter/parkExit/getCurrentTicks/getCurrentThreadId) is also reachable only through overrides in DatadogProfilingIntegration that nothing in production invokes.

Suggestion: Either land the instrumentation advice that calls these methods in the same change, or drop the unused interface/implementation surface until the consuming instrumentation is ready. At minimum, fix the Javadoc so it does not assert a calling path that does not exist.

[CONSENSUS · concern: necessity · confidence: high]

Comment thread .gitlab-ci.yml
mkdir -p "${CI_PROJECT_DIR}/custom-ddprof"
SRCDIR="${CI_PROJECT_DIR}/java-profiler-src"
rm -rf "$SRCDIR"
git clone --depth 1 --branch "$JAVA_PROFILER_REF" https://github.com/DataDog/java-profiler.git "$SRCDIR"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Sphinx Review — MEDIUM] External java-profiler clone is pinned only to a mutable branch ref (--branch $JAVA_PROFILER_REF --depth 1) with no commit SHA. The produced ddprof.jar is injected into downstream jobs including deploy_to_maven_central and deploy_snapshot_with_ddprof_snapshot, with no reproducibility or supply-chain integrity guarantee. Any force-push to the branch silently changes the artifact without a CI diff.

Suggestion: Require a commit SHA (or signed tag) rather than a branch for the ref, or restrict the custom-build path to non-publish pipelines. At minimum, exclude build_java_profiler_ddprof from deploy_to_maven_central / deploy_snapshot needs so a custom build can never reach a release.

[concern: security · confidence: medium]

return 0;
}

@Override

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Sphinx Review — MEDIUM] encode/encodeOperationName/encodeResourceName overrides delegate to DatadogProfiler.encode, which is hardcoded to return 0 ('java-profiler ContextSetter no longer exposes value encoding'). The overrides therefore always return 0, identical to the interface defaults, but add call overhead on every span creation.

Suggestion: Drop the three no-op overrides (let the interface defaults stand) until real encoding is restored in ddprof, or document clearly with a TODO that encoding is intentionally disabled.

[concern: necessity · confidence: medium]

continue;
}
taskBlockCount++;
long spanId = spanIdAccessor.getMember(item).longValue();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Sphinx Review — MEDIUM] spanIdAccessor, localRootSpanIdAccessor, blockerAccessor and unblockingSpanIdAccessor are dereferenced with .getMember(item) without null checks, while operationAccessor and eventThreadAccessor in the same loop are null-guarded. JMC IAttribute.getAccessor() returns null when the attribute is absent from the event type (e.g. schema change or older ddprof artifact). The sibling SynchronizedContentionProfilingTest guards every accessor, showing the authors consider null a real possibility.

Suggestion: Guard each accessor with null checks before calling getMember, mirroring the SynchronizedContentionProfilingTest pattern, so a schema mismatch produces a clear failure rather than an opaque NPE.

[CONSENSUS · concern: robustness · confidence: high · also affects line 345 (emittedAccessor in addWallClockEpochs)]

if (recording == null) {
log.warn("Datadog Profiler is not available. Skipping task-block integration test.");
return;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Sphinx Review — MEDIUM] The test emits TaskBlock events (recordTaskBlockEvent and parkExit) and immediately stops the recording with no flush/drain barrier between emission and stop. The Javadoc describes an asynchronous drain thread for the park path, making the taskBlockCount > 0 assertion timing-dependent and potentially flaky. Separately, profiler.stop(recording) internally calls recording.stop(), then the finally block calls recording.stop() again unconditionally (double-stop).

Suggestion: Ensure events are flushed/drained before calling stop (or assert only on the synchronous recordTaskBlock path). Drop the unconditional recording.stop() in finally since stop(recording) already stops it. Consider @flaky if timing cannot be made deterministic.

[CONSENSUS · concern: correctness · confidence: medium]

unblockingSpanId,
spanId,
rootSpanId);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Sphinx Review — LOW] TaskBlockBridge dispatches every blockEnter/parkEnter/recordTaskBlock through Method.invoke with autoboxed varargs Object[] arguments. The interface Javadoc explicitly calls these hot-path methods, so reflective dispatch plus per-call boxing/array allocation will be a measurable overhead once callers exist.

Suggestion: When the loaded ddprof exposes these methods, bind them once via MethodHandle (invokeExact) or a small generated adapter to avoid per-call boxing on the hot path.

[concern: performance · confidence: low]


public interface ProfilingContextIntegration extends Profiling, EndpointCheckpointer, Timer {
/** Native {@code OSThreadState::SLEEPING}; used for span-scoped Thread.sleep precheck state. */
int BLOCKING_STATE_SLEEPING = 7;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Sphinx Review — LOW] Interface constant BLOCKING_STATE_SLEEPING (=7) has no readers anywhere in the codebase. It is intended as the state argument for blockEnter(int), but no caller of blockEnter exists on this branch.

Suggestion: Defer adding this constant until the Thread.sleep/park instrumentation that consumes it lands, or document it explicitly as a placeholder API for the upcoming instrumentation.

[concern: necessity · confidence: high]

}

/** Monotonic tick count for TaskBlock and wall-clock off-CPU interval timing. */
public long getCurrentTicks() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Sphinx Review — LOW] Public getCurrentTicks() dereferences profiler directly with no null guard, while every sibling bridge accessor (getCurrentThreadId, getTscFrequency, hasTaskBlockEventSupport, blockEnter, parkEnter, etc.) guards with profiler != null. If profiler can be null (as the guards elsewhere assume), getCurrentTicks() NPEs.

Suggestion: Add the same profiler != null guard used by the other bridge methods, returning a sentinel (e.g. 0L) when the profiler is unavailable.

[concern: consistency · confidence: medium]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants