Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • 00:48
    wbillingsley opened #12420
  • Jun 22 23:31
    retronym commented #12419
  • Jun 22 23:31
    smarter commented #12419
  • Jun 22 23:29

    SethTisue on 2.12.x

    2.12: new Scala SHA (#1436) (compare)

  • Jun 22 23:29
    SethTisue closed #1436
  • Jun 22 23:26
    retronym commented #12419
  • Jun 22 22:04
    SethTisue synchronize #1436
  • Jun 22 22:01
    deanwampler commented #1252
  • Jun 22 22:00
    deanwampler-domino commented #1253
  • Jun 22 21:55
    SethTisue commented #1252
  • Jun 22 21:55
    SethTisue edited #1253
  • Jun 22 21:55
    SethTisue opened #1253
  • Jun 22 21:53
    SethTisue commented #1252
  • Jun 22 21:52
    SethTisue closed #1252
  • Jun 22 21:46
    SethTisue synchronize #1436
  • Jun 22 21:39
    deanwampler opened #1252
  • Jun 22 21:37
    deanwampler opened #2090
  • Jun 22 18:49
    SethTisue commented #1432
  • Jun 22 18:48

    SethTisue on 2.13.x

    remove JDK 16 workarounds that … (compare)

  • Jun 22 18:45
    SethTisue commented #1432
D Cameron Mauch
@DCameronMauch
People here have said before that a Dataset is more like a view/projection kind of thing. The underlying data structure is still there, no matter how you view it.
The extra fields get cleaned up with some kind of map operation. But we have lots of cases where we load DataFrames, convert to Datasets, join, etc...
Luis Miguel Mejía Suárez
@BalmungSan
@DCameronMauch I mean I would expect that df.as[T] would do some kind of select to prune unnecessary data.
D Cameron Mauch
@DCameronMauch
The huge issue is we are actually using like 10 columns from data with 800 columns, and when we join, we shuffle, including the extra 790 unused columns.
I was very shocked myself when I found out it does not do that
case class User(id: Long, firstName: String, lastName: String)

val df1: DataFrame = Seq(
  (1L, "John", "Doe", 37, "male"),
  (2L, "Jane", "Doe", 22, "female")
).toDF("id", "firstName", "lastName", "age", "gender")

val ds1: Dataset[User] = df1.as[User]

val df2: DataFrame = ds1.toDF

df2.printSchema
df2.show(2, false)
Yields:
root
 |-- id: long (nullable = false)
 |-- firstName: string (nullable = true)
 |-- lastName: string (nullable = true)
 |-- age: integer (nullable = false)
 |-- gender: string (nullable = true)

+---+---------+--------+---+------+
|id |firstName|lastName|age|gender|
+---+---------+--------+---+------+
|1  |John     |Doe     |37 |male  |
|2  |Jane     |Doe     |22 |female|
+---+---------+--------+---+------+
Luis Miguel Mejía Suárez
@BalmungSan
:0
D Cameron Mauch
@DCameronMauch
Yeah… :-(
I was thinking I could maybe just do df.as[T].map(identity)
Or maybe df.as[T].map(_.copy())
Some invocation of map that returns the same object type, and thus clears out the cruft
Eric K Richardson
@ekrich
I have dealt with this too and it can be a big problem as you say.
Spark can be pretty dicey.
D Cameron Mauch
@DCameronMauch
Aside from the shuffle, if the code looks like spark.load.parquet(“…”).select(…).as[T], then it can avoid even reading the unused columns in the first place
2 replies
I think Spark is not smart enough to do that if I .map after the .as[T]
Eric K Richardson
@ekrich
It is like SQL where you need to specify the projection (columns) other wise you get and from any joins as well.
D Cameron Mauch
@DCameronMauch
Oh yeah, I tested that too. All the extra columns from both joined df’s are present in the output of the join
I really like the efficiency of DataFrames, but like the type safety of Datasets
Eric K Richardson
@ekrich
Totally agree.
D Cameron Mauch
@DCameronMauch
I wish Frameless was a much better maintained project, as it supposedly gives best of both worlds
As is, my team won’t accept it, because they think it’s half-baked, and no one is maintaining it
Eric K Richardson
@ekrich
Actually, there is not suppose to be any performance hit really with datasets.
D Cameron Mauch
@DCameronMauch
The problem is it can’t optimize lambdas
If you create some dataset of T, and do something with it, it can only do that after it reifies the object.
So even if you only use a handful of the fields in the case class, Spark can’t load only those fields from whatever storage
Seth Tisue
@SethTisue
@ImGrayMouser_twitter seconded that learn-you-a-Scala coding challenges should most definitely not be directing you to use Array
D Cameron Mauch
@DCameronMauch
The only place I would except to see Array in a Scala application is in a main method, to satifying the signature.
Luis Miguel Mejía Suárez
@BalmungSan
Not even there if you use IOApp from cats-effect :grimacing:
Also, I believe Scala 3 will have some something similar to one of the Li' libraries to automatically parse the arguments into a case class, so your main method would not see the Array
Matt Hicks
@darkfrog26
Ugh...80% of my open-source projects now released for Scala 3.
Rob Norris
@tpolecat
:tada:
I'm waiting on a few things but I'm almost done.
Eric K Richardson
@ekrich
Nice job!
Life without macros is easier.
Matt Hicks
@darkfrog26
@ekrich sometimes yes, sometimes no
Eric K Richardson
@ekrich
At least for porting, not for what you are doing :smile:
Seth Tisue
@SethTisue
Luis Miguel Mejía Suárez
@BalmungSan
:tada:
gitleet
@gitleet
I'm confused, why does typesafe config return back java collections?
Luis Miguel Mejía Suárez
@BalmungSan
Well mostly because is a Java library.
gitleet
@gitleet
:) ok the typesafe namespace got me confused
Luis Miguel Mejía Suárez
@BalmungSan
The good news is that transforming a Java collection into a Scala one is pretty easy thanks to the CollectionConverters in the stdlib :)
Matt Hicks
@darkfrog26
@gitleet typesafe-config is pure evil. :-p
Luis Miguel Mejía Suárez
@BalmungSan
You may also want to give a look to pureconfing which is a Scala wrapper over typesafe config.
Matt Hicks
@darkfrog26
I'd recommend Profig or PureConfig (the former being better because I wrote it). :-p
Oh, and Profig doesn't rely on typesafe-config
I'm honestly surprised that typesafe-config is still so widely used
Rob Norris
@tpolecat
I have had pretty good luck with Ciris.