feat(storage): add resource span attributes for ACO ( App Centric Observability )#16119
feat(storage): add resource span attributes for ACO ( App Centric Observability )#16119bajajneha27 wants to merge 17 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a private helper method EnrichSpan to populate OpenTelemetry span attributes (gcp.resource.destination.id and gcp.resource.destination.location) using bucket metadata upon successful bucket operations (such as creation, retrieval, updates, and locking). It also adds corresponding unit tests to verify these attributes. The review comments suggest making EnrichSpan static since it does not access member variables, and checking for an uninitialized project number (value 0) to avoid generating invalid resource IDs.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #16119 +/- ##
========================================
Coverage 92.20% 92.20%
========================================
Files 2264 2267 +3
Lines 208864 209341 +477
========================================
+ Hits 192579 193033 +454
- Misses 16285 16308 +23 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
b9d9575 to
53127d4
Compare
c978382 to
ab9f016
Compare
| return internal::EndSpan(*span, impl_->GetObjectMetadata(request)); | ||
| EnrichSpan(*span, request.bucket_name()); | ||
| auto result = impl_->GetObjectMetadata(request); | ||
| MaybeInvalidate(result, request.bucket_name()); |
There was a problem hiding this comment.
MaybeInvalidate is called on almost all operations, including object-level operations. If a user requests a non-existent object in a valid bucket, the operation returns 404, and the bucket is evicted from the cache.
We should either call MaybeInvalidate on bucket-level operations where a 404 guaranteed means the bucket is gone, or check status.message() to distinguish bucket-404 from object-404 (this approach is somewhat brittle).
We could also do what Python does in this case by evicting only if the bucket is truly gone: https://github.com/googleapis/google-cloud-python/blob/384724c2d4c955e15274e9824bcdb93c685b79f6/packages/google-cloud-storage/google/cloud/storage/_bucket_metadata_cache.py#L68.
There was a problem hiding this comment.
I'll make this change once we decide on the background thread discussion.
There was a problem hiding this comment.
I can invalidate the cache only on bucket operations, and not on object operations.
If we want to follow how it's done in Python, we'd need to make extra API call to check the existence of the bucket.
There was a problem hiding this comment.
Let's invalidate the cache only on bucket operations for now. We can revisit this in the future, if we want to be more precise.
| } | ||
|
|
||
| auto current_options = google::cloud::internal::SaveCurrentOptions(); | ||
| auto f = std::async(std::launch::async, [this, bucket_name, |
There was a problem hiding this comment.
std::async spawns threads dynamically, which causes the destructor of TracingConnection to block waiting for all background tasks to complete. Is there an alternate way?
Instead of spawning a new thread dynamically for every cache miss via std::async, can TracingConnection manage a single, long-lived background worker thread?
There was a problem hiding this comment.
We can, and that would probably be a better option. But the metadata fetch on cache miss would not happen concurrently in that case which can be acceptable because I think cache misses will be infrequent. WDYT ?
There was a problem hiding this comment.
As discussed offline, we'll keep the current approach of having background threads so that the bucket metadata fetch can happen concurrently for different buckets.
The only concern over here was that the threads should have some sort of deadline / timeout. So, I think that can be taken care of by the retry_policy that we have configured. As soon as retries are exhausted, the thread would end too.
There was a problem hiding this comment.
Just ensure we don't retry in case of permission errors. I think this should already be there but good to check.
There was a problem hiding this comment.
The PR protects against multiple fetches for the same bucket (via in_flight_fetch_). However, there is no global limit on the total number of threads. RPC timeouts and retry policies do not solve the performance problems which come with constant thread creation/destruction.
Let's keep this comment open for now until we are confident with this approach. Will move this discussion offline.
| bg_tasks_.end()); | ||
| } | ||
|
|
||
| void TracingConnection::EnrichSpan(opentelemetry::trace::Span& span, |
There was a problem hiding this comment.
We also need to provide an option to disable this feature.
There was a problem hiding this comment.
I added a new option in options.h and kept it enabled by default.
There was a problem hiding this comment.
@cpriti-os is the plan for all clients to keep this behavior on or off by default?
|
|
||
| class BucketMetadataCache { | ||
| public: | ||
| explicit BucketMetadataCache(std::size_t max_size = 10000) |
There was a problem hiding this comment.
IIRC, there was some discussion to make this value configurable. Let's make sure this value/behavior is consistent across languages.
cc: @cpriti-os
There was a problem hiding this comment.
@kalragauri I don't think we need to make this configurable, just a reasonably low memory consumption should be good with a flag to disable it all together. Since this isn't exactly a user facing feature, a lot of options and configurations for internal logic can be confusing for users.
No description provided.