These are chat archives for twitter/summingbird

30th
Sep 2016
Pankaj Gupta
@pankajroark
Sep 30 2016 00:10
Hi, great to see a room for summingbird.
P. Oscar Boykin
@johnynek
Sep 30 2016 00:10
hi
Pankaj Gupta
@pankajroark
Sep 30 2016 00:17
I’ve pinged heron team to confirm support for apache storm namespace
P. Oscar Boykin
@johnynek
Sep 30 2016 00:17
ok, cool.
Pankaj Gupta
@pankajroark
Sep 30 2016 00:19
we’ll likely need to make changes on our end as well to incorporate OSS Heron. The namespace part should simple I hope.
Do you foresee any difficulties?
P. Oscar Boykin
@johnynek
Sep 30 2016 00:19
no. I think it should be find-and-replace.
At stripe, they did not follow the recommendations and stored kryo serialized objects.
Pankaj Gupta
@pankajroark
Sep 30 2016 00:20
:(
that’s very tricky
P. Oscar Boykin
@johnynek
Sep 30 2016 00:20
yeah. That’s going to be a major pain.
I think we need to first migrate off that.
Pankaj Gupta
@pankajroark
Sep 30 2016 00:20
which version of Kryo is Stripe on?
2.21?
P. Oscar Boykin
@johnynek
Sep 30 2016 00:20
then adopt the updated summingbird/storm
yeah
whatever is in summingbird master
Pankaj Gupta
@pankajroark
Sep 30 2016 00:21
yeah
there’s even 4.0 now :)
P. Oscar Boykin
@johnynek
Sep 30 2016 00:21
well, that’s what we get with no classpath isolation
storm picks a version and we have to share it, if we use it.
we could actually stop using kryo in summingbird...
Pankaj Gupta
@pankajroark
Sep 30 2016 00:22
really?
what would we replace it with?
P. Oscar Boykin
@johnynek
Sep 30 2016 00:22
what if we moved to the scalding serialization macros.
Pankaj Gupta
@pankajroark
Sep 30 2016 00:22
I’m not familiar with them, do they support any arbitrary object?
do they support version evolution?
P. Oscar Boykin
@johnynek
Sep 30 2016 00:23
you don’t need version evolution for data on the wire.
this would not be for long term. This is between bolts.
Pankaj Gupta
@pankajroark
Sep 30 2016 00:23
yes, but what if people end up using that for writing to stores, without realizing?
P. Oscar Boykin
@johnynek
Sep 30 2016 00:23
that’s a bug right?
Pankaj Gupta
@pankajroark
Sep 30 2016 00:24
what I’m trying to say is how is that better than kryo?
P. Oscar Boykin
@johnynek
Sep 30 2016 00:24
they shouldn’t do this.
oh, it is just not tied to storm’s classpath
the reason we are tied to a version is because storm also uses kryo
Pankaj Gupta
@pankajroark
Sep 30 2016 00:24
I see
P. Oscar Boykin
@johnynek
Sep 30 2016 00:24
and they set a version. if you set the wrong one, you will have binary errors.
Pankaj Gupta
@pankajroark
Sep 30 2016 00:24
yeah, that would be difficult
how is the classpath isolation typically done?
P. Oscar Boykin
@johnynek
Sep 30 2016 00:25
also, the other win is that you get a compile time check that serialization works. No runtime errors.
Pankaj Gupta
@pankajroark
Sep 30 2016 00:25
yeah, that would be great, serialization errors are a big pain
P. Oscar Boykin
@johnynek
Sep 30 2016 00:25
you have classloaders or you do shading
honestly, I worry about these projects...
they get weighted down by twitter’s software approach.
like even bugs are hard to fix because no one wants a change in your mono-repo.
Pankaj Gupta
@pankajroark
Sep 30 2016 00:26
the classloader isolation has somehow never become popular
P. Oscar Boykin
@johnynek
Sep 30 2016 00:27
it is just a lot of work.
Pankaj Gupta
@pankajroark
Sep 30 2016 00:27
yeah, there is some pain because of monorepo
but to be honest it also makes things easier
Piyush Narang
@piyushnarang
Sep 30 2016 00:27
are we ranting about monorepos?
P. Oscar Boykin
@johnynek
Sep 30 2016 00:27
for you.
Pankaj Gupta
@pankajroark
Sep 30 2016 00:27
making changes to summingbird and getting then into source is also a big pain
Piyush Narang
@piyushnarang
Sep 30 2016 00:27
I feel the pain too :-)
P. Oscar Boykin
@johnynek
Sep 30 2016 00:27
you guys block our changes all the time due to anxiety about adopting them
:)
Pankaj Gupta
@pankajroark
Sep 30 2016 00:28
@johnynek only the ones that run into actual issues like the memory platform :)
P. Oscar Boykin
@johnynek
Sep 30 2016 00:28
so we are on like a 2 year old version of storm and kryo.
well, that is not it. Other changes are discussed and we get push-back.
anyway...
like we would like to make the changes to make serialization type-safe
Pankaj Gupta
@pankajroark
Sep 30 2016 00:30
I’m all for it actually, curios what kind of pushback you’ve got
P. Oscar Boykin
@johnynek
Sep 30 2016 00:30
but we are skeptical you would accept them since it would require a significant work on changing code on your end.
or the recent changes to algebird to interop with spire and cats.
or the long standing discussions about using a more recent storm.
Pankaj Gupta
@pankajroark
Sep 30 2016 00:31
the delay with algebird change was unfortunate actually, Joe and Alex were out for a couple weeks
I don’t think we are opposed to that change
P. Oscar Boykin
@johnynek
Sep 30 2016 00:31
okay, that’s good.
Piyush Narang
@piyushnarang
Sep 30 2016 00:31
might require a bit of work like you mentioned though
P. Oscar Boykin
@johnynek
Sep 30 2016 00:32
that one should be almost no work.
the serialization improvements would be real work.
Piyush Narang
@piyushnarang
Sep 30 2016 00:32
ok cool
Pankaj Gupta
@pankajroark
Sep 30 2016 00:32
also sorry about the delay with recent storm, the delay in getting OSS Heron internally is very unfortunate but it’s a different team...
P. Oscar Boykin
@johnynek
Sep 30 2016 00:32
I understand… I know how it is.
Pankaj Gupta
@pankajroark
Sep 30 2016 00:33
About serialization, I did have something I wanted to discuss
P. Oscar Boykin
@johnynek
Sep 30 2016 00:33
yes
Pankaj Gupta
@pankajroark
Sep 30 2016 00:33
have we considered something like flat buffers?
P. Oscar Boykin
@johnynek
Sep 30 2016 00:33
what do you mean exactly?
Pankaj Gupta
@pankajroark
Sep 30 2016 00:33
representation that doesn’t require deserialization
P. Oscar Boykin
@johnynek
Sep 30 2016 00:33
oh, well, that is hard to do, I think, on the JVM
Pankaj Gupta
@pankajroark
Sep 30 2016 00:33
it keeps data in the binary format
P. Oscar Boykin
@johnynek
Sep 30 2016 00:33
C/C++/Rust they can do things like this
Pankaj Gupta
@pankajroark
Sep 30 2016 00:33
there’s java implementation as well
basically they keep a byte array and offsets in there
it can potentially eliminate a lot of serialization/deserializaton overhead
but I don’t know how well it works in practice
it seems very promising for summingbird usecase because we typically only access internal data once I think
equals and hashcode operations will work really fast
we can even communicate with local caching processes easily
e.g. run memcache locally and avoid java’s gc etc
P. Oscar Boykin
@johnynek
Sep 30 2016 00:36
I see. So it is like protocol buffers, but generates a flattened data structure
Pankaj Gupta
@pankajroark
Sep 30 2016 00:36
yes
P. Oscar Boykin
@johnynek
Sep 30 2016 00:36
I mean, we could already do this now without summingbird changes.
Pankaj Gupta
@pankajroark
Sep 30 2016 00:37
yeah, I think so
we use thrift so heavily though that we’ll need to write a converter from thrift schema
P. Oscar Boykin
@johnynek
Sep 30 2016 00:37
just use flatbuffer types, a custom store implementation, and kryo support for flatbuffers.
that might not be too hard.
what might be really cool is to have a scala macro that could generate a flatbuffer builder from java.
so you could just use a case class
Pankaj Gupta
@pankajroark
Sep 30 2016 00:38
yeah, that would be really great
I’ve no idea how much work that would be
P. Oscar Boykin
@johnynek
Sep 30 2016 00:38
probably somewhat similar to the work for type-descriptor
which is terribly named and would be better named “TypeFlattener"
Pankaj Gupta
@pankajroark
Sep 30 2016 00:39
I see,
P. Oscar Boykin
@johnynek
Sep 30 2016 00:39
that is the macro that gives you type-safe tsv/csv support
which is also a form of a flattened serialization
Pankaj Gupta
@pankajroark
Sep 30 2016 00:40
so one thing is that we’ll need objects to be flatbuffer from the beginning
if we have java objects and we convert then to flatbuffer only when we send them between nodes then we will not benefit much
or if we had to convert them back to java objects
P. Oscar Boykin
@johnynek
Sep 30 2016 00:41
right.
Pankaj Gupta
@pankajroark
Sep 30 2016 00:41
not familiar with how the type descriptor works but does it allow that?
so, folks will need to be able to create flatbuffer objects directly too
P. Oscar Boykin
@johnynek
Sep 30 2016 00:42
it is not about that, it is about looking at a type and testing to see if it can be flattened
I guess for this to work you would need a trait actually, because if you have to build a case class you might already have lost.
Pankaj Gupta
@pankajroark
Sep 30 2016 00:42
yeah
we’ll basically have factory functions that generate these objects
P. Oscar Boykin
@johnynek
Sep 30 2016 00:43
so, you want the macro to generate the code to implement an abstract class/trait by just reading from a ByteBuffer.
Pankaj Gupta
@pankajroark
Sep 30 2016 00:43
yes, pretty much
code generator basically
P. Oscar Boykin
@johnynek
Sep 30 2016 00:43
that is what a macro is.
Pankaj Gupta
@pankajroark
Sep 30 2016 00:44
yeah, so that would be one part of work,
hooking it up with kryo should be simple I guess
P. Oscar Boykin
@johnynek
Sep 30 2016 00:44
yeah, that part would be really easy.
Pankaj Gupta
@pankajroark
Sep 30 2016 00:44
so code generation is the important part
we can’t directly use the flat buffer code here if we use macros
we’ll need to basically translate that code over
P. Oscar Boykin
@johnynek
Sep 30 2016 00:45
maybe, maybe not. Maybe we can call their code in the macro/
Pankaj Gupta
@pankajroark
Sep 30 2016 00:46
hmm… yeah proabably, if we keep an internal object of their type
P. Oscar Boykin
@johnynek
Sep 30 2016 00:46
that said...
scalding already has a pretty nice work for binary serialization with macros
and other tricks
Pankaj Gupta
@pankajroark
Sep 30 2016 00:46
yeah, I heard about the Twadoop work
P. Oscar Boykin
@johnynek
Sep 30 2016 00:46
and I don’t think it was widely rolled out due to the perceived work.
so, we could make that required in scalding.
see how that can lower costs/speed up.
but there will be migration work.
Pankaj Gupta
@pankajroark
Sep 30 2016 00:48
yeah, I think Alex has been doing the parquet macros work and stuff like that, he would have more context on that
P. Oscar Boykin
@johnynek
Sep 30 2016 00:48
really, Twitter needs to build scala refactoring tools if they are going to have a monorepo
Pankaj Gupta
@pankajroark
Sep 30 2016 00:49
scala is a hard language to refactor is my impression, I’m probably wrong :)
Piyush Narang
@piyushnarang
Sep 30 2016 00:50
a lot of the monorepo pains extend beyond refactoring too
P. Oscar Boykin
@johnynek
Sep 30 2016 00:50
I disagree! The type-safety can make that very safe
true
Pankaj Gupta
@pankajroark
Sep 30 2016 00:50
but there are so many features in scala, doesn’t that make it harder?
Piyush Narang
@piyushnarang
Sep 30 2016 00:50
releasing any large version upgrade is scary
Pankaj Gupta
@pankajroark
Sep 30 2016 00:51
plus git is so slow… There I said it
Piyush Narang
@piyushnarang
Sep 30 2016 00:51
haha
P. Oscar Boykin
@johnynek
Sep 30 2016 00:51
haha
Pankaj Gupta
@pankajroark
Sep 30 2016 00:52
@johnynek thanks for your input about serialization. I’m going to be off on paternity soon, this is a topic that interests me a lot.
I can probably do some work on my own and see if I can push it after I come back.
P. Oscar Boykin
@johnynek
Sep 30 2016 00:53
enjoy! Ian and I have been talking about making another push to improve and factor out scalding serialization
maybe we can really make it more generally useful.
Pankaj Gupta
@pankajroark
Sep 30 2016 00:53
some jobs spend 40% time just serializing/deserializing
yeah that would be great
P. Oscar Boykin
@johnynek
Sep 30 2016 00:54
another thing we can do is reduce the number of stages in summingbird
so there is less serialization
Pankaj Gupta
@pankajroark
Sep 30 2016 00:54
@johnynek yes, the spout map side aggregation was really aimed at that
we need to do more of that
like do it not just at the spout but later in the pipeline as well
we need to solve the forking problem first
P. Oscar Boykin
@johnynek
Sep 30 2016 00:54
yeah. We can compose after sumByKey, which we don’t do now, I don’t think.
Pankaj Gupta
@pankajroark
Sep 30 2016 00:55
yeah
P. Oscar Boykin
@johnynek
Sep 30 2016 00:55
we need to make more use of the graph optimization system here.
otherwise, this is really hard to code up.
Pankaj Gupta
@pankajroark
Sep 30 2016 00:56
Timur Abhisev just joined our team, I’ll get him excited about these ideas :)
P. Oscar Boykin
@johnynek
Sep 30 2016 00:56
great!
Pankaj Gupta
@pankajroark
Sep 30 2016 00:56
we should probably choose the most important of these optimizations and start there
P. Oscar Boykin
@johnynek
Sep 30 2016 00:56
indeed
Pankaj Gupta
@pankajroark
Sep 30 2016 00:57
we need to solve the forking problem, right now we insert a node whenever there is a fork
P. Oscar Boykin
@johnynek
Sep 30 2016 00:57
yeah, that seems pretty solvable.
Pankaj Gupta
@pankajroark
Sep 30 2016 00:58
the downstream sumByKey merge problem is less critical because more than 90% topologies just do one sumByKey
P. Oscar Boykin
@johnynek
Sep 30 2016 00:58
wow. Okay.
but they fork?
Pankaj Gupta
@pankajroark
Sep 30 2016 00:58
yeah most jobs are really simple
P. Oscar Boykin
@johnynek
Sep 30 2016 00:58
how many fork?
Pankaj Gupta
@pankajroark
Sep 30 2016 00:58
yeah, they read from multiple sources
like some of them
e.g. tsar only has one sumByKey
P. Oscar Boykin
@johnynek
Sep 30 2016 00:58
I would call that merge, not fork
Pankaj Gupta
@pankajroark
Sep 30 2016 00:59
true
P. Oscar Boykin
@johnynek
Sep 30 2016 00:59
I thought tsar did several layers for different time ranges
Pankaj Gupta
@pankajroark
Sep 30 2016 00:59
may be, almost all topologies I’ve seen have single sumByKey
I’m still getting familiar with Tsar code, so don’t know for sure
but I’ve looked at the topology graphs
P. Oscar Boykin
@johnynek
Sep 30 2016 01:00
cool
Pankaj Gupta
@pankajroark
Sep 30 2016 01:00
the most common case is source -> flatmap -> summer
the second most common is multiple sources -> flatmap -> summer