I believe there could be an argument over interpretation. Although it's a mouthful, I think using the term "inverse probability" is helpful. I'm also in favor of calling it a lower bound--where a lower bound on inverse probability equates with an upper bound on probability. It's saying that "at the time of
Extract on a context, we believed the sampling rate a.k.a. inverse probability was no less than the indicated value.
I say this because some sampling schemes are a bit speculative about what is kept-- I'm thinking of reservoir sampling approaches.
Samplerrequired to re-evaluate its decision when new attributes are set.
My position is (1) that callers ought to be able to tell whether a span operation will have no effect w/o a lazy interface, (2)
UpdateName should not exist,
SetName is OK, (3) Sampler should be considered a "head" sampler.
Sampler decision informs whether a
SpanData will be built and processed. The span processors can all implement their own sampling designs after the decision is made to build a
SpanData, and these will each be recorded with different sampling rates. It's in this setting that I consider the propagated sampling rate to be a lower bound--it's the result of a head-sampling decision to build a span or trace based on the initial conditions, whereas the span or trace could eventually be recorded with a higher sampling rate if it survives (through random chance) some sort of selection process.
@jmacd regarding my earlier call-out, It was less to do with sampling decisions, and more about the ability to addLink after span creation for me, which I understood was removed due to sampling related concerns.
Having said that, I do indeed have some views on sampling so happy to chip-in some of my thoughts:
Some characteristics that I have found useful of samplers (in-house based) are:
I'm very much not in the loop here, but I was reading on OTel sampling somewhere mentioning the troubles of coordination, and I was reminded of this WeChat overload control paper that I figured could be of interest as an approach: https://www.cs.columbia.edu/~ruigu/papers/socc18-final100.pdf
In there, they only need to align the hashing algorithm they pick for overload control to give a priority to queries on a kind of user basis to ensure that a given user's transactions work end-to-end across all of their microservices (3000 services over 20,000 machines). There's a huge parallel with distributed tracing sampling to be compared there -- you want all traces everywhere to line up and let a full lineage of an operation be visible, and they want user transactions across thousands of services to succeed under heavy load without heavy coordination.
In a nutshell, their trick was to define 128 levels of user priorities (which are assigned by time-bound hashing algorithms so that over time slices, the priorities of various users are changed, ensuring that eventually during the day a user gets service), couple them to business priority rules (admin > sales > etc.), and then they check the current overall load availability of the local service to do a quick lookup and know whether to keep or shed a thing. That lets them quickly, based on locally observed load and predefined hash schedules, make decisions that without coordination tend to shed load similarly for all related flows of a given user or session and give successful end-to-end transactions. They also added a feedback system where a responding service that had to shed load feeds that decision to the parent, which can then abort some further downstream load.
@awssandra It's Matt (MSFT), good to meet you today. Some questions about the X-Ray Spec (by Section):
1) Do you consider X-Ray "head-based", "tail-based" sampling, or something in between?
2) Does X-Ray impact throughput of the "entry service" (assuming no memory constraints)?
3) Is there a scenario where sampling at the Client would be useful (before data is sent)?
4) Do you have any requests to cap volume by default? Seems like "reservoir" + 5% rate could be expensive with high-volume.
5) Does "Service type" only refer to the entry service? What if two difference service types are in a dependency chain?
6) Do you have any usage data on these filters? For instance, is "URL Path" used enough to justify it?
7) How long does it take for sampling rules changes to take effect? Any customer feedback here?
8) Without the X-Ray SDK, is sampling unavailable? Do customers have other sampling options?
9) Can customers using 3rd-party monitoring venders take advantage of X-Ray?