Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    raja sekar
    @rajasekarv
    Currently you have to use serde_traitobject::Arc for dynamic dispatch as I have implemented it only for that. I will add other box types too soon like std Arc, Box
    Nacho Duart
    @iduartgomez
    just for the fact that is not necessary to pass around the RT generic bound anymore is a gain :D
    much more ergonomic this way
    raja sekar
    @rajasekarv
    Yeah
    And it allows for more flexibility which won't be possible previously. Like storing a mix of mappedrdd, parallel collection, flatmappedrdd, etc., in a collection
    I came across the blatant limitation of previous design when I tried to do sample etl job using this framework. You can't just apply bunch of transformations to a same rdd variable and store it in the same variable in loop as it leads to infinte cyclic type which Rust doesn't support for now.
    raja sekar
    @rajasekarv
    So it is necessary to make RDD object safe to be passed around as trait object
    Nacho Duart
    @iduartgomez
    Is not possible to get rid of RddBase altogether now?
    raja sekar
    @rajasekarv
    As far as I know it's not possible, atleast in current Rust. Existential type in Rust is not as flexible as Scalas'. You have to store bunch of completely different RDDs in a list while recursing the dependency tree to generate the DAG. After yesterday's update, now we can have Vec<Rdd<T>> where each Rdd could be from different concrete type like mappedrdd, sampledrdd,etc., as long as they all have the same type parameter T. But it is not possible to have Vec<Rdd<for any type>> yet in Rust. And this is needed when generating DAG.
    Nacho Duart
    @iduartgomez
    Maybe when GATs land it will be possible somehow then.
    raja sekar
    @rajasekarv
    Yeah maybe
    raja sekar
    @rajasekarv
    @iduartgomez just wanted to check. Did you try unionRdd with new API?
    Nacho Duart
    @iduartgomez
    yes, I updated it on my branch, but still the same problems
    still have to debug them to see if I can find the problem
    Nacho Duart
    @iduartgomez
    btw now we have the rdd.rs module inside the rdd folder which is a bit of an anti-pattern, could we move enverying in there to rdd/mod.rs or any reason is there?
    Nacho Duart
    @iduartgomez
    *everything
    raja sekar
    @rajasekarv
    Oh ok. Let me also have a look at it
    Nacho Duart
    @iduartgomez
    https://github.com/iduartgomez/native_spark/tree/dev has the last changes and parity with master branch
    raja sekar
    @rajasekarv
    I am not aware that it is an anti pattern. I thought that, even for those who are not familiar with Rust, it will be easy to know where to look for RDD trait definition.
    Nacho Duart
    @iduartgomez
    ok, we can make an exception for this maybe, will add an attribute to ignore the lint in the future
    raja sekar
    @rajasekarv
    Yeah that sounds good
    I am going to write one update blog. What do you think should be included in that. I have prepared the roadmap and a draft and small demo for dataframe design. And raise some issues regarding security in closure serialization, REPL support, and call for help in HDFS support.
    In addition how does photon sounds for the project name?
    Nacho Duart
    @iduartgomez
    I like photon :) Re. the blog, sounds good seems a lot to cover in one blog; but if we wait a bit and fix/add any remaining issues you could announce in the blog that we have almost reached parity regarding core RDD functionality which would show the project is under development and had decent progress in short time
    raja sekar
    @rajasekarv
    Yeah that's ok. Just started writing. Planning to release in 2 parts. Not going to release within 2 weeks.
    After I update the scheduler and block manager with compression, we should be pretty good to go for extensive testing and benchmarking phase.
    Nacho Duart
    @iduartgomez
    the next big steep to improve performance will probably be make the network stack async as well as make all task submission and execution asynchronous
    if we make a good bechmark suite we will be able to measurte the impact but for the most part I beleive we will be IO bound most of the time and any computation heavy part Rust is likely to beat up Scala easily
    although here depends a lot more on what the client of the library does inside the rdd's as writing performant code is a language like Rust is harder if you are a novice with the language than with a GC language
    luckly for us Rust async implementation is first-class and has plenty good network libraries to make this very efficient too
    raja sekar
    @rajasekarv
    Yes it is indeed true that users need to put more thought into the code than GC languages. For example doing things like this is to_string().len() is pretty common among python/Scala developers. This will hurt a lot in Rust. Large number of small allocations is worse in non-GC languages. However, this is not that common in Spark setting where they mostly perform number crunching and aggregation.
    And when Dataframes are done here, we can have much more optimized functions available to users out of the box. Like stack allocated strings for smaller strings, etc.,
    raja sekar
    @rajasekarv
    @iduartgomez I saw your commits. It looks good. I consiously avoided adding Clap as dependecy. It has terrible compilation speed. Anyway we already have syn, quote, serde as dependencies. So it might not be that bad now.
    By any chance did you check the build times before and after?
    Fresh build and incremental build
    Nacho Duart
    @iduartgomez
    Didn't measure it sorry. I added the stripped down version and with incremental compilation wasn't really noticeable, didn't pull any extra heavy weight transitive dependencies. I will take a look today, if it's too bad I can change toa lighter cmd line parser library.
    I don't think is that bad to pull this though as it can be used to handle environment variables as well and we will need to use a parser for utilities and binaries (like when we have our own version of submit or cluster deployment binaries).
    raja sekar
    @rajasekarv
    Yeah I think it's ok. I also didn't notice anything significant.
    We can just concentrate on adding features. Later we can see about these.
    Nacho Duart
    @iduartgomez
    @rajasekarv I swapped lazy_init for once_cell to make error handling possible during initialization of config etc.
    raja sekar
    @rajasekarv
    👍
    Anton Kochkov
    @XVilka
    In the future it would be awesome to have everything integrated together, like timberio/vector#988
    raja sekar
    @rajasekarv
    Yeah. There is already a request similar to this
    Anton Kochkov
    @XVilka
    Ah, sorry, I forgot I opened issues already
    raja sekar
    @rajasekarv
    Haha. No problem
    Nacho Duart
    @iduartgomez
    been kind of busy last days so wasn't able to finish any of the ongoing changes I have pending
    hopefully I will be able to debug the issues with union and finish a couple changes then will be able to merge some stuff
    will probably need to add repartition/coalesce functionality to finish some of the stuff
    raja sekar
    @rajasekarv
    No problem. Even I am inbetween relocation. However, I found out what the problem is in Union. Will let you know in detail tomorrow.
    Nacho Duart
    @iduartgomez
    since you already found the problem i wont expend time trying to debug, when you get a chance update issue #41 with the details and I will add any changes to the branch :) meanwhile i will be adding coalesce rdd since we need it for other stuff