Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
P. Oscar Boykin
@johnynek
you can write a Fold, which is more general, but sorry I don't immediately see an answer to your problem
i think representing as Map[K, Long] where you keep track of the counts, might make it clearer
but pruning that is not associative.
we have cms with topk, but it is only approximately associative
Shumon Madzhumder
@shumn
Ok. I realize what I am asking may not be even correct. Just needed some validation of this fact. I can always do a groupBy just wondering if there is an abstract algaebraic construct for this.
P. Oscar Boykin
@johnynek
not that immediately comes to mind
it is related to a count min sketch
Shumon Madzhumder
@shumn
OK, I will look these up. i.e count min sketch and TopK
P. Oscar Boykin
@johnynek
:thumbsup:
Vaibhav Tulsyan
@xennygrimmato
Hello, I wish to contribute to algebird. Is there some beginner issue that needs to be fixed? I can start off with something small. I would appreciate some guidance from the maintainers of the project. Thanks! :)
P. Oscar Boykin
@johnynek
Vaibhav Tulsyan
@xennygrimmato
Thanks @johnynek. I'm assuming all these issues still require work.
Can you please give me some background on this issue - twitter/algebird#326
I'll try to understand what the issues with the test are.
P. Oscar Boykin
@johnynek
@xennygrimmato I think the link was to a different line when I made it, and now commits have changed what is on that line. I think I was pointing here: https://github.com/twitter/algebird/blob/develop/algebird-test/src/test/scala/com/twitter/algebird/HyperLogLogTest.scala#L57 basically, all the random HLL's we create have a single element in them. That is not great. Really, we should generate a list of longs, for instance, and add all of them to the HLL. That would be a better test. I expect the tests to still pass, but it is something we should think about: high quality and good coverage random generation.
Vaibhav Tulsyan
@xennygrimmato
@johnynek Ok, that makes sense. Let me think of a good test case for this then. I'll discuss my approach on the issue itself, would you prefer that? I can create a PR after that
Mateusz Fedoryszak
@matfed
Hello all! I'm looking for implicit class adding Scalding-style sumByKey/aggregateByKey to ordinary Scala collections. Is there such thing in Algebird? If not, I'd be happy to create pull request.
P. Oscar Boykin
@johnynek
@matfed there is this: https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/MapAlgebra.scala#L196 we could do with a syntax enrichment package probably. That would be nice to add.
Vaibhav Tulsyan
@xennygrimmato
@johnynek Is there some documentation for the SparseHLL case class? I want to understand what maxRhow represents here: https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/HyperLogLog.scala#L393
Specifically, I want to know the use of Max[Byte] there.
Marco Didonna
@noiano
hello
I was wondering how can you serialize a Bloom Filter instance
you know to write it out somewhere
then retrieve it and deserialize as BF[String] for example
P. Oscar Boykin
@johnynek
@noiano there isn't code in algebird to do this, but you can pattern match on BF and serialize each case.
Kryo would probably work fine for it... the basic idea is just serialize the indices of all the bits.
(that are set)
mohnishkodnani
@mohnishkodnani
Hi,
I have a requirement to serialize the sketchmap at periodic intervals and then if the process crashes, regenerate the state from this snapshot. The applicaiton reads from kafka and updates sketchmap.
Any ideas on how to serialize it.
P. Oscar Boykin
@johnynek
@mohnishkodnani to serialize a sketchmap you need serializer for the key types and the value types.
well... actually that's not true... only the value type.
then you just iterate through the table and serialize each of the Vs
if the received knows the table dimensions, you just do that loop on the receiving side
if you don't know the size (which I would recommend you do statically know in your app) you can write the dimensions of the table first, then all the V values