Add wallclock TaskBlock profiler support#11544
Conversation
This comment has been minimized.
This comment has been minimized.
|
|
||
| JAVA_PROFILER_REF: | ||
| description: "When non-empty, clone DataDog/java-profiler at this Git ref (branch or tag), build ddprof, and use it as ddprof.jar for Gradle jobs instead of the Maven dependency." | ||
| value: "paul.fournillon/wallclock-taskblock" |
There was a problem hiding this comment.
[Sphinx Review — HIGH] JAVA_PROFILER_REF defaults to a personal feature branch ('paul.fournillon/wallclock-taskblock'). The build_java_profiler_ddprof job fires whenever the value is non-empty, so by default the entire pipeline builds and injects ddprof from this personal branch instead of the published artifact. This includes deploy_to_maven_central and deploy_snapshot_with_ddprof_snapshot. If the personal branch is deleted, force-pushed, or renamed, 'git clone --branch' fails and breaks all builds.
Suggestion: Set the default value to an empty string before merge (keep the override mechanism for ad-hoc testing). Apply the same change to env JAVA_PROFILER_REF in .github/workflows/run-system-tests.yaml.
[CONSENSUS · confidence: high · adversary-tested]
| * @param spanId span ID captured at sleep entry | ||
| * @param rootSpanId root span ID captured at sleep entry | ||
| */ | ||
| default void enqueueTaskBlock( |
There was a problem hiding this comment.
[Sphinx Review — MEDIUM] enqueueTaskBlock (interface default) and DatadogProfiler.recordTaskBlockFromContextEvent / TaskBlockBridge.recordTaskBlockFromContext have zero callers anywhere in the repo. The Javadoc claims enqueueTaskBlock is 'Called from the Thread.sleep instrumentation finish path', but that path does not exist in this PR. The entire new TaskBlock bridge surface (blockEnter/blockExit/parkEnter/parkExit/getCurrentTicks/getCurrentThreadId) is also reachable only through overrides in DatadogProfilingIntegration that nothing in production invokes.
Suggestion: Either land the instrumentation advice that calls these methods in the same change, or drop the unused interface/implementation surface until the consuming instrumentation is ready. At minimum, fix the Javadoc so it does not assert a calling path that does not exist.
[CONSENSUS · concern: necessity · confidence: high]
| mkdir -p "${CI_PROJECT_DIR}/custom-ddprof" | ||
| SRCDIR="${CI_PROJECT_DIR}/java-profiler-src" | ||
| rm -rf "$SRCDIR" | ||
| git clone --depth 1 --branch "$JAVA_PROFILER_REF" https://github.com/DataDog/java-profiler.git "$SRCDIR" |
There was a problem hiding this comment.
[Sphinx Review — MEDIUM] External java-profiler clone is pinned only to a mutable branch ref (--branch $JAVA_PROFILER_REF --depth 1) with no commit SHA. The produced ddprof.jar is injected into downstream jobs including deploy_to_maven_central and deploy_snapshot_with_ddprof_snapshot, with no reproducibility or supply-chain integrity guarantee. Any force-push to the branch silently changes the artifact without a CI diff.
Suggestion: Require a commit SHA (or signed tag) rather than a branch for the ref, or restrict the custom-build path to non-publish pipelines. At minimum, exclude build_java_profiler_ddprof from deploy_to_maven_central / deploy_snapshot needs so a custom build can never reach a release.
[concern: security · confidence: medium]
| return 0; | ||
| } | ||
|
|
||
| @Override |
There was a problem hiding this comment.
[Sphinx Review — MEDIUM] encode/encodeOperationName/encodeResourceName overrides delegate to DatadogProfiler.encode, which is hardcoded to return 0 ('java-profiler ContextSetter no longer exposes value encoding'). The overrides therefore always return 0, identical to the interface defaults, but add call overhead on every span creation.
Suggestion: Drop the three no-op overrides (let the interface defaults stand) until real encoding is restored in ddprof, or document clearly with a TODO that encoding is intentionally disabled.
[concern: necessity · confidence: medium]
| continue; | ||
| } | ||
| taskBlockCount++; | ||
| long spanId = spanIdAccessor.getMember(item).longValue(); |
There was a problem hiding this comment.
[Sphinx Review — MEDIUM] spanIdAccessor, localRootSpanIdAccessor, blockerAccessor and unblockingSpanIdAccessor are dereferenced with .getMember(item) without null checks, while operationAccessor and eventThreadAccessor in the same loop are null-guarded. JMC IAttribute.getAccessor() returns null when the attribute is absent from the event type (e.g. schema change or older ddprof artifact). The sibling SynchronizedContentionProfilingTest guards every accessor, showing the authors consider null a real possibility.
Suggestion: Guard each accessor with null checks before calling getMember, mirroring the SynchronizedContentionProfilingTest pattern, so a schema mismatch produces a clear failure rather than an opaque NPE.
[CONSENSUS · concern: robustness · confidence: high · also affects line 345 (emittedAccessor in addWallClockEpochs)]
| if (recording == null) { | ||
| log.warn("Datadog Profiler is not available. Skipping task-block integration test."); | ||
| return; | ||
| } |
There was a problem hiding this comment.
[Sphinx Review — MEDIUM] The test emits TaskBlock events (recordTaskBlockEvent and parkExit) and immediately stops the recording with no flush/drain barrier between emission and stop. The Javadoc describes an asynchronous drain thread for the park path, making the taskBlockCount > 0 assertion timing-dependent and potentially flaky. Separately, profiler.stop(recording) internally calls recording.stop(), then the finally block calls recording.stop() again unconditionally (double-stop).
Suggestion: Ensure events are flushed/drained before calling stop (or assert only on the synchronous recordTaskBlock path). Drop the unconditional recording.stop() in finally since stop(recording) already stops it. Consider @flaky if timing cannot be made deterministic.
[CONSENSUS · concern: correctness · confidence: medium]
| unblockingSpanId, | ||
| spanId, | ||
| rootSpanId); | ||
| } |
There was a problem hiding this comment.
[Sphinx Review — LOW] TaskBlockBridge dispatches every blockEnter/parkEnter/recordTaskBlock through Method.invoke with autoboxed varargs Object[] arguments. The interface Javadoc explicitly calls these hot-path methods, so reflective dispatch plus per-call boxing/array allocation will be a measurable overhead once callers exist.
Suggestion: When the loaded ddprof exposes these methods, bind them once via MethodHandle (invokeExact) or a small generated adapter to avoid per-call boxing on the hot path.
[concern: performance · confidence: low]
|
|
||
| public interface ProfilingContextIntegration extends Profiling, EndpointCheckpointer, Timer { | ||
| /** Native {@code OSThreadState::SLEEPING}; used for span-scoped Thread.sleep precheck state. */ | ||
| int BLOCKING_STATE_SLEEPING = 7; |
There was a problem hiding this comment.
[Sphinx Review — LOW] Interface constant BLOCKING_STATE_SLEEPING (=7) has no readers anywhere in the codebase. It is intended as the state argument for blockEnter(int), but no caller of blockEnter exists on this branch.
Suggestion: Defer adding this constant until the Thread.sleep/park instrumentation that consumes it lands, or document it explicitly as a placeholder API for the upcoming instrumentation.
[concern: necessity · confidence: high]
| } | ||
|
|
||
| /** Monotonic tick count for TaskBlock and wall-clock off-CPU interval timing. */ | ||
| public long getCurrentTicks() { |
There was a problem hiding this comment.
[Sphinx Review — LOW] Public getCurrentTicks() dereferences profiler directly with no null guard, while every sibling bridge accessor (getCurrentThreadId, getTscFrequency, hasTaskBlockEventSupport, blockEnter, parkEnter, etc.) guards with profiler != null. If profiler can be null (as the guards elsewhere assume), getCurrentTicks() NPEs.
Suggestion: Add the same profiler != null guard used by the other bridge methods, returning a sentinel (e.g. 0L) when the profiler is unavailable.
[concern: consistency · confidence: medium]
What Does This Do
Add support for wall-clock TaskBlock events, including span-context attribution and native event recording for blocking intervals such as
Object.waitand synchronized contention.Motivation
Make blocking time visible as structured
datadog.TaskBlockevents instead of relying only on sampled wall-clock frames. It allows to reduce signal emissions.Additional Notes
Foundation for follow-up LockSupport and Thread.sleep TaskBlock instrumentation.
Contributor Checklist
type:and (comp:orinst:) labels in addition to any other useful labelsclose,fix, or any linking keywords when referencing an issueUse
solvesinstead, and assign the PR milestone to the issue/merge. You can also:/merge --commit-message "..."/merge -c/merge -f --reason "reason"; please use this judiciously, as some checks do not run at the PR-levelJira ticket: [PROJ-IDENT]