I believe there could be an argument over interpretation. Although it's a mouthful, I think using the term "inverse probability" is helpful. I'm also in favor of calling it a lower bound--where a lower bound on inverse probability equates with an upper bound on probability. It's saying that "at the time of
Extract on a context, we believed the sampling rate a.k.a. inverse probability was no less than the indicated value.
I say this because some sampling schemes are a bit speculative about what is kept-- I'm thinking of reservoir sampling approaches.
Samplerrequired to re-evaluate its decision when new attributes are set.
My position is (1) that callers ought to be able to tell whether a span operation will have no effect w/o a lazy interface, (2)
UpdateName should not exist,
SetName is OK, (3) Sampler should be considered a "head" sampler.
Sampler decision informs whether a
SpanData will be built and processed. The span processors can all implement their own sampling designs after the decision is made to build a
SpanData, and these will each be recorded with different sampling rates. It's in this setting that I consider the propagated sampling rate to be a lower bound--it's the result of a head-sampling decision to build a span or trace based on the initial conditions, whereas the span or trace could eventually be recorded with a higher sampling rate if it survives (through random chance) some sort of selection process.
@jmacd regarding my earlier call-out, It was less to do with sampling decisions, and more about the ability to addLink after span creation for me, which I understood was removed due to sampling related concerns.
Having said that, I do indeed have some views on sampling so happy to chip-in some of my thoughts:
Some characteristics that I have found useful of samplers (in-house based) are:
I'm very much not in the loop here, but I was reading on OTel sampling somewhere mentioning the troubles of coordination, and I was reminded of this WeChat overload control paper that I figured could be of interest as an approach: https://www.cs.columbia.edu/~ruigu/papers/socc18-final100.pdf
In there, they only need to align the hashing algorithm they pick for overload control to give a priority to queries on a kind of user basis to ensure that a given user's transactions work end-to-end across all of their microservices (3000 services over 20,000 machines). There's a huge parallel with distributed tracing sampling to be compared there -- you want all traces everywhere to line up and let a full lineage of an operation be visible, and they want user transactions across thousands of services to succeed under heavy load without heavy coordination.
In a nutshell, their trick was to define 128 levels of user priorities (which are assigned by time-bound hashing algorithms so that over time slices, the priorities of various users are changed, ensuring that eventually during the day a user gets service), couple them to business priority rules (admin > sales > etc.), and then they check the current overall load availability of the local service to do a quick lookup and know whether to keep or shed a thing. That lets them quickly, based on locally observed load and predefined hash schedules, make decisions that without coordination tend to shed load similarly for all related flows of a given user or session and give successful end-to-end transactions. They also added a feedback system where a responding service that had to shed load feeds that decision to the parent, which can then abort some further downstream load.