Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Liz Fong-Jones
    @lizthegrey
    e.g. that if you're sampling based on endpoint
    yes, that's precisely correct, we need a standard baggage field
    Joshua MacDonald
    @jmacd
    OK, I would agree to that. :)
    Liz Fong-Jones
    @lizthegrey
    so that all otel understands how to read the sampling field even if generated by a different language SDK etc
    I was putting off discussing it as long as we were still mired in 0.3 land
    but for 0.4 I'd like to see us do it
    essentially I think a lot of the disconnect here is vendors that have their own metrics systems may not care that much about sampling precision, whereas those of us that are sampling but reporting the multiplied out sample totals wind up needing it to approximate the total number of rpcs etc
    Joshua MacDonald
    @jmacd

    I believe there could be an argument over interpretation. Although it's a mouthful, I think using the term "inverse probability" is helpful. I'm also in favor of calling it a lower bound--where a lower bound on inverse probability equates with an upper bound on probability. It's saying that "at the time of Extract on a context, we believed the sampling rate a.k.a. inverse probability was no less than the indicated value.

    I say this because some sampling schemes are a bit speculative about what is kept-- I'm thinking of reservoir sampling approaches.

    I have a second concern about sampling, which has to do with several loose ends in the Span API:
    • how can a caller tell whether a span is a no-op, or shall we recommend a lazy interface for any kind of deferment
    • shall "UpdateName" be a special case
    • is the Sampler required to re-evaluate its decision when new attributes are set.
    Joshua MacDonald
    @jmacd

    My position is (1) that callers ought to be able to tell whether a span operation will have no effect w/o a lazy interface, (2) UpdateName should not exist, SetName is OK, (3) Sampler should be considered a "head" sampler.

    The Sampler decision informs whether a SpanData will be built and processed. The span processors can all implement their own sampling designs after the decision is made to build a SpanData, and these will each be recorded with different sampling rates. It's in this setting that I consider the propagated sampling rate to be a lower bound--it's the result of a head-sampling decision to build a span or trace based on the initial conditions, whereas the span or trace could eventually be recorded with a higher sampling rate if it survives (through random chance) some sort of selection process.

    To firm this up, I'm suggesting that the default SDK should implement a head sampler, one that does not re-evaluate sampling decisions. The span processors and otel collector can implement tail sampling, and we can propagate a lower bound of sampling rate. The propagated lower bound value helps us limit the volume of trace data collection, whereas actual sampling rates are likely to be computed in the span processors, not in the Sampler.
    Liz Fong-Jones
    @lizthegrey
    +++ yes, love it
    indeed, this is for head-sampling only.
    and we can do tail sampling later in collector, processors, or in your satellites/our refinery/etc.
    but you have to start somewhere to start cutting the bulk down
    glad we're violently in agreement here
    Joshua MacDonald
    @jmacd
    ((( If you let me talk much longer on this topic, we'll come to the paper Magic sets and other strange ways to implement logic programs. (SIGMOD, 1986) and it will be a great digression )))
    Evgeny Yakimov
    @eyjohn

    @jmacd regarding my earlier call-out, It was less to do with sampling decisions, and more about the ability to addLink after span creation for me, which I understood was removed due to sampling related concerns.

    Having said that, I do indeed have some views on sampling so happy to chip-in some of my thoughts:

    Some characteristics that I have found useful of samplers (in-house based) are:

    1. Ability to influence sampling from application code (i.e. the application code can force sample)
    2. Late evaluation of sampling decision (for the addLink case)
    3. Very much pro Liz's proposal on sample_rate, although I was thinking that this is more of a semantic convention that vendors can adhere to, rather than a defined property on the data model
    Fred Hebert
    @ferd

    I'm very much not in the loop here, but I was reading on OTel sampling somewhere mentioning the troubles of coordination, and I was reminded of this WeChat overload control paper that I figured could be of interest as an approach: https://www.cs.columbia.edu/~ruigu/papers/socc18-final100.pdf

    In there, they only need to align the hashing algorithm they pick for overload control to give a priority to queries on a kind of user basis to ensure that a given user's transactions work end-to-end across all of their microservices (3000 services over 20,000 machines). There's a huge parallel with distributed tracing sampling to be compared there -- you want all traces everywhere to line up and let a full lineage of an operation be visible, and they want user transactions across thousands of services to succeed under heavy load without heavy coordination.

    In a nutshell, their trick was to define 128 levels of user priorities (which are assigned by time-bound hashing algorithms so that over time slices, the priorities of various users are changed, ensuring that eventually during the day a user gets service), couple them to business priority rules (admin > sales > etc.), and then they check the current overall load availability of the local service to do a quick lookup and know whether to keep or shed a thing. That lets them quickly, based on locally observed load and predefined hash schedules, make decisions that without coordination tend to shed load similarly for all related flows of a given user or session and give successful end-to-end transactions. They also added a feedback system where a responding service that had to shed load feeds that decision to the parent, which can then abort some further downstream load.

    as I said I'm out of the loop, but it sounded like something that could very much be relevant to sampling more or less consistently without coordination.
    Liz Fong-Jones
    @lizthegrey
    dropping a few links here that I had starred to respond to but didn't get to in the chaos of last month
    in the hopes that other people here pick them up &| debate them
    Joshua MacDonald
    @jmacd
    @lizthegrey I was briefly confused myself, reading the above about "sampling priority" because "priority sampling" is the name of a weighted sampling algorithm (see "Priority sampling for estimation of arbitrary subset sums").
    Liz Fong-Jones
    @lizthegrey
    reviving the conversations from February: we should probably actually get ready to progress on this issue soon now that things are less chaotic
    Sandra McMullen
    @awssandra
    Hello there! Sandra from AWS X-Ray here - Looks like we've started a Friday morning zoom meet on this topic - just wondering if anyone is attending this week (given the long weekend) - thanks!
    Ted Young
    @tedsuo
    Yes, actually I am going to be unavailable. Perhaps we should cancel this week? If others want to use te time slot and run the meeting, feel welcome.
    Sandra McMullen
    @awssandra
    Most of my team is unavailable
    Yusuke Tsutsumi
    @toumorokoshi
    yes agreed, cancelling is a good idea. Also I'm switching jobs, so meeting at 9am may be hard for me regardless. But I'll contribute where I can and be at the spec meetings!
    Ted Young
    @tedsuo
    Hi all, just a reminder that we have a sampling meeting in 10 min. If this time is no longer the best, maybe we can pick a different one?
    Also, is there anyone interested in sampling who has not been attending the meetings? If so, why not? I’d like to improve this process so that we can get sampling API resolved satisfactorily in the next several weeks.
    Joshua MacDonald
    @jmacd
    Sorry @tedsuo I have been mostly on PTO this week. I hope it's productive.
    Tristan Sloughter
    @tsloughter
    I'm interested in keeping up to date on the WG. but probably can't be making the meetings
    today been stuck in meetings...
    Matt McCleary
    @mattmccleary

    @awssandra It's Matt (MSFT), good to meet you today. Some questions about the X-Ray Spec (by Section):

    Workflow
    1) Do you consider X-Ray "head-based", "tail-based" sampling, or something in between?
    2) Does X-Ray impact throughput of the "entry service" (assuming no memory constraints)?
    3) Is there a scenario where sampling at the Client would be useful (before data is sent)?

    Sampling Rule
    4) Do you have any requests to cap volume by default? Seems like "reservoir" + 5% rate could be expensive with high-volume.
    5) Does "Service type" only refer to the entry service? What if two difference service types are in a dependency chain?
    6) Do you have any usage data on these filters? For instance, is "URL Path" used enough to justify it?

    Work Modes
    7) How long does it take for sampling rules changes to take effect? Any customer feedback here?
    8) Without the X-Ray SDK, is sampling unavailable? Do customers have other sampling options?
    9) Can customers using 3rd-party monitoring venders take advantage of X-Ray?

    Elizabeth Heinlein
    @newrelic-eheinlein
    Is today's meeting canceled? There are currently only two of us on the Zoom call.
    Sandra McMullen
    @awssandra
    Sorry I was late
    Jason Feingold
    @jifeingo

    I been looking at the sampler for .NET and wanted to propose building the concept of an Aggregate Sampler (AndSmapler, OrSampler). Here is a draft of a OTEP describing it. At a high level, it describes building a sampler that evaluates the results of multiple inner samplers to make a decision. I feel that it may make things easier in scenarios where multiple considerations/approaches go into making a sampling decision.

    https://github.com/jifeingo/oteps/blob/AggregateSamplers/0000-sampler-and.md

    Is this the right place to discuss something like this?

    Liudmila Molkova
    @lmolkova
    hey, created issue to allow tracestate changes by sampler: open-telemetry/opentelemetry-specification#856. Please comment!
    Nizar Tyrewalla
    @ntyrewalla
    @lizthegrey is there an existing SIG in place for sampling strategy, would like to participate
    1 reply
    Przemek Maciolek
    @pmm-sumo
    It seems that currently when sampling happens, the information about sampling rate used or rule selected for tail-based filtering is not being stored anywhere. I was going through the specs and issues but haven't found anything. I am wondering if introducing several attributes such as sampling.rate or sampling.rule would make sense to fill that gap. Or maybe there's a reason to not include such information?
    1 reply
    Otmar Ertl
    @oertl

    Hi all, I'm new here, and I'm not sure if this is the right place for my concern and if it hasn't already been addressed elsewhere.With trace ID ratio based sampling, a trace can break into several parts, if there is a sampler with a smaller ratio in the middle. You then know that all spans belong to the same trace, but it is not possible to determine which part is a child of another.

    For example, consider a trace consisting of 3 spans, A -> B -> C. The sampling ratio is 1 for A and C, while it is 0.01 for B. This means that only A and C are collected while B is discarded with 99% probability. Since both will have the same trace ID, we just know they are part of the same trace. However, the parent span ID of C will be the span ID of B which is not collected and therefore not known finally. Hence, it is not possible in the general case to conclude that span C is a (grand-)child of A. (In this particular case, we know that C must be a child of A, because A is the root span. In the general case, this reasoning is not possible though.)

    A possible solution would be forwarding the parent span ID in case a sampler decides not to sample a span. For that example, this means that the parent ID of A instead of B would be received for C, and the span data of C would finally report span A as its parent. A further improvement would be to also count the number of ancestor spans that have not been sampled. If this counter is carried by the span context, incremented after every hop, and finally added to the span data, it would be easy to find out, how many generations are between the parent span. In other words, we would know, for example, if a span is a child, grand-child, or a great-grandchild.

    Przemek Maciolek
    @pmm-sumo
    Hi @oertl , I believe this is not the case. Since all spans in your example share the same trace_id (and it's used to compute the hash), they will be consistently sampled (unless there are some collectors with different hash seed configured at the same tier)
    8 replies
    Ted Young
    @tedsuo
    Hi all. Now that the v1.0 is finally out the door, I’m looking to get a weekly meeting rolling again. If you are interested, please respond to this issue with you availablility: open-telemetry/community#632
    Khyati Gandhi
    @khyati612:matrix.org
    [m]
    Hi Everyone!! I am new here..not sure whether this is a correct platform to put up this question.I have a requirement to integrate a sampling processor which samples the spans based on the conditions more like tail-based sampling processor.But I came across many documents that there is plan to deprecate it. Is anyone having idea, is there any replacement available or any pointers to customise it can be really helpful?
    Juraci Paixão Kröhling
    @jpkrohling
    probably a question for the OpenTelemetry Collector channel in in the CNCF Slack, but I can answer it here, as you probably read that information based on something I wrote. You can certainly use the current tail-based sampling processor today. The plan to deprecate is mainly about splitting it into two, and even if we decide to go in this direction, the configuration will likely be similar between the solutions.
    here's an example of how the config might look like for both scenarios: