Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Gary Coady
    @fiadliel
    probability distribution function
    Timothy Perrett
    @timperrett
    ah ok
    :)
    @fiadliel im not aware of anything like that, we’ve done some things internally with post-processing the funnel data, but nothing thats public right now.
    Gary Coady
    @fiadliel
    Okay - well other than that, maybe will try to resurrect the Histogram type. It is quite hard to get a good feel for the distribution shape without further work on top of the raw numbers.
    Timothy Perrett
    @timperrett
    @fiadliel yeah that would be cool - i’d welcome that
    @fiadliel im curious, what are you using for monitoring right now?
    Gary Coady
    @fiadliel
    how I’ve seen it done before is either buckets with exponentially increasing latency (1ms, 2, 4, 8, 16, …, MAX) or rounded numbers (1, 2, 5, 10, 20, 50, 100, …, MAX) which are more human-readable. And the obvious nice quality that this can be aggregated across instances. But it’s a lot of extra time series.
    I hate to even say it, we’ve historically mostly used NewRelic, I won’t say any more :-( :-( There is also some monitoring with InfluxDB, and AWS CloudWatch. We’re not using Prometheus, but I’ve looked at it a tiny bit, it’s not as clean as funnel, but then not many services are :-)
    Timothy Perrett
    @timperrett
    @fiadliel i’ve been thinking about what could be done to improve funnel, and probally the largest issue thus far is that we need to optiomise the number of streams we end up with - the stream per metric can sometimes become a bit burdonsome, so i’d be open to suggestions for a different, more optiomal design that makes adding many more vectors easier.
    @fiadliel funny you mention prometheus - they have a similar design in some sense (at the wire layer, with how services connect and pull), but they dont have the orchestrator. Sometimes i wonder if implementing their protocol would be useful just to get the higerperformance collectors of theirs.
    Gary Coady
    @fiadliel
    It might be worth it just to gain more cross-platform abilities; if people want “one monitoring platform to rule them all”, they need support for more than Scala.
    funnel for its client API + prometheus for the server infrastructure mightn’t be so bad
    Gary Coady
    @fiadliel
    my impression is that an optimized time-series DB will work a lot better over time than ElasticSearch, but I’ve actually never used ElasticSearch at scale for this purpose.
    Timothy Perrett
    @timperrett
    @fiadliel yeah ES was essentially a stop-gap for us
    @fiadliel we’ve made it scale, but it was not without its problems - there are so many operational nuances
    @fiadliel going forward internally we’ll be pumping all our data down our internal pipeline, and ultimately putting that stuff in a time-series DB
    writing new output modules is easy though
    if you need one for influx etc
    but yeah, im with you on getting support for more than scala - kinda feel like changing the internal funnel model and writing out the prometheus format might be the way to go - then we could get chemist to do partition management of the various prometheus collectors
    just spit-balling, of course
    but it owuld be good to take advanttage of whats changed in the eco system in the past years
    Gary Coady
    @fiadliel
    I’ll play around with some ideas myself — we do need something reasonable, and prometheus/funnel both seem to have parts of the solution.
    Timothy Perrett
    @timperrett
    @fiadliel agreed - we only did any of this stuff because we were not satisfied with the alternitives. I think we have a partial solution - its certainly no panacea. If you’re looking for your comerical work, then i’d love to collaborate on changes in design etc and getting something that works for both companies :)
    Gary Coady
    @fiadliel
    Monitoring is just hard :-) Lots of writes, and generally a large active dataset for reading.
    Timothy Perrett
    @timperrett
    myself and @djspiewak have discussed alternitive designs recently for changes to the internals, but its non-trivial.
    yep exactly
    i’d really like to get a multi-vector model - what we have is probally a little too constrained
    Gary Coady
    @fiadliel
    like the labels prometheus has — it’s powerful but complex to understand, still useful for pivoting across different views on your data.
    Timothy Perrett
    @timperrett
    yeah - im torn on its power. There are cases where i wish we had that, but the simplicity of what we currently do also has its alure.
    i like simple solutions
    or at least, the minimally powerful thing
    Gary Coady
    @fiadliel
    when it comes to monitoring though, sometimes it’s a badly behaving user (show latency by user), sometimes it’s a bad network switch (show latency by rack), sometimes a machine has a bad NIC, perhaps your new software revision is bad (divide by software version). So it can be useful.
    Timothy Perrett
    @timperrett
    @fiadliel oh im totally with you
    Gary Coady
    @fiadliel
    But last time I used a monitoring system that powerful, one person on our team basically wrote monitoring rules full-time ;-)
    Timothy Perrett
    @timperrett
    @fiadliel essentially the ssytem needs to be powerful enough to do the nessicary, but simple enough to use with a few basic conventions, and the system should absolutly plan for abuse by the devs
    so having ways to cap usage or throughput is key
    on the basis that the system WILL be abused
    Gary Coady
    @fiadliel
    if you are in charge of pulling the data, the only danger is #time series there, and with labels you get a multiplicative effect. So probably some kind of quota would be useful.
    Then there’s also a cost on querying.
    Timothy Perrett
    @timperrett
    yup yup
    nice to chat with someone who also appreciates the complexities of monitoring :)
    Gary Coady
    @fiadliel
    I was an SRE with Google at one point ;)
    (eh, ops / site reliability engineer)
    I have to say, 1 second granularity on latency split across many dimensions, is completely awesome, but very expensive!
    I’ll have a think about things anyway; thanks for the chat
    Timothy Perrett
    @timperrett
    np
    Guillaume Massé
    @MasseGuillaume
    hehe I see lot's of action in the package index from this project
    Guillaume Massé
    @MasseGuillaume
    Alex Henning Johannessen
    @ahjohannessen
    @MasseGuillaume I think it is an issue for all oncue projects on bintray, same thing for remotely artifacts. Could someone with credentials please look into this? /cc @timperrett @stew @runarorama