Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
P. Oscar Boykin
@johnynek
@kailuowang
I'll look at them.
Kai(luo) Wang
@kailuowang
I created a ticket to track / facilitate discussion on this issue
typelevel/cats#1879
Usman Ijaz
@uijaz59_twitter

Hi,

I am new to these algorithms and curious about the differences between Sliding HyperLogLog(https://hal.archives-ouvertes.fr/hal-00465313/file/sliding_HyperLogLog.pdf) vs HyperLogLog Series. I want to create thousands of counters for providing sliding window based cardinality estimation. e.g. cardinality estimate for last 30 days, last 7 days and last 24 hours.

  • Will the hyperloglog series evict/forget the older data for example in my case data older then 30 days?
  • Does the size of hyperloglog series increase with time? For a 12 bit counter, what would be the minimum and maximum size?

I am trying to find the answers to these questions and it would be really helpful if I get a quick response.

Thanks.

Shumon Madzhumder
@shumn

Hey, I have a case class Thing(name: String). I need to "reduce" a Set[Thing] into a Set[Thing] where resultant set is the one with a max count of identical names. That is,

Set(Thing("Cory"), Thing("Cory"), Thing("Ahmad"), Thing("Kevin"), Thing("Kevin")) "reduces" to
Set(Thing("Cory"), Thing("Cory"), Thing("Kevin"), Thing("Kevin")).

How do I neatly put this into one of the structures defined in algebird?

An empty set won't reduce to anything in my case, so there is no identity element here. But, it is also not evident to me how it falls into a semigroup since the result of the reduction is "many" not "one". Max looked promising at first but I still don't see how to leverage it.
Shumon Madzhumder
@shumn
so it should reduce either to one thing. if counts are the same it will reduce to many. There are more fields in Thing not only name. I just made it simple within the context of this example.
Shumon Madzhumder
@shumn
change Set to List
P. Oscar Boykin
@johnynek
this seems related to a topk type problem
I'm not 100% sure what you want is actually associative... which means it may not be a semigroup/monoid
you can write a Fold, which is more general, but sorry I don't immediately see an answer to your problem
i think representing as Map[K, Long] where you keep track of the counts, might make it clearer
but pruning that is not associative.
we have cms with topk, but it is only approximately associative
Shumon Madzhumder
@shumn
Ok. I realize what I am asking may not be even correct. Just needed some validation of this fact. I can always do a groupBy just wondering if there is an abstract algaebraic construct for this.
P. Oscar Boykin
@johnynek
not that immediately comes to mind
it is related to a count min sketch
Shumon Madzhumder
@shumn
OK, I will look these up. i.e count min sketch and TopK
P. Oscar Boykin
@johnynek
:thumbsup:
Vaibhav Tulsyan
@xennygrimmato
Hello, I wish to contribute to algebird. Is there some beginner issue that needs to be fixed? I can start off with something small. I would appreciate some guidance from the maintainers of the project. Thanks! :)
P. Oscar Boykin
@johnynek
Vaibhav Tulsyan
@xennygrimmato
Thanks @johnynek. I'm assuming all these issues still require work.
Can you please give me some background on this issue - twitter/algebird#326
I'll try to understand what the issues with the test are.
P. Oscar Boykin
@johnynek
@xennygrimmato I think the link was to a different line when I made it, and now commits have changed what is on that line. I think I was pointing here: https://github.com/twitter/algebird/blob/develop/algebird-test/src/test/scala/com/twitter/algebird/HyperLogLogTest.scala#L57 basically, all the random HLL's we create have a single element in them. That is not great. Really, we should generate a list of longs, for instance, and add all of them to the HLL. That would be a better test. I expect the tests to still pass, but it is something we should think about: high quality and good coverage random generation.
Vaibhav Tulsyan
@xennygrimmato
@johnynek Ok, that makes sense. Let me think of a good test case for this then. I'll discuss my approach on the issue itself, would you prefer that? I can create a PR after that
Mateusz Fedoryszak
@matfed
Hello all! I'm looking for implicit class adding Scalding-style sumByKey/aggregateByKey to ordinary Scala collections. Is there such thing in Algebird? If not, I'd be happy to create pull request.
P. Oscar Boykin
@johnynek
@matfed there is this: https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/MapAlgebra.scala#L196 we could do with a syntax enrichment package probably. That would be nice to add.
Vaibhav Tulsyan
@xennygrimmato
@johnynek Is there some documentation for the SparseHLL case class? I want to understand what maxRhow represents here: https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/HyperLogLog.scala#L393
Specifically, I want to know the use of Max[Byte] there.
Marco Didonna
@noiano
hello
I was wondering how can you serialize a Bloom Filter instance
you know to write it out somewhere
then retrieve it and deserialize as BF[String] for example
P. Oscar Boykin
@johnynek
@noiano there isn't code in algebird to do this, but you can pattern match on BF and serialize each case.
Kryo would probably work fine for it... the basic idea is just serialize the indices of all the bits.
(that are set)
mohnishkodnani
@mohnishkodnani
Hi,
I have a requirement to serialize the sketchmap at periodic intervals and then if the process crashes, regenerate the state from this snapshot. The applicaiton reads from kafka and updates sketchmap.
Any ideas on how to serialize it.
P. Oscar Boykin
@johnynek
@mohnishkodnani to serialize a sketchmap you need serializer for the key types and the value types.
well... actually that's not true... only the value type.
then you just iterate through the table and serialize each of the Vs
if the received knows the table dimensions, you just do that loop on the receiving side
if you don't know the size (which I would recommend you do statically know in your app) you can write the dimensions of the table first, then all the V values