Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • 21:35
    ghik commented #12314
  • 21:34
    ghik commented #12314
  • 17:33
    SethTisue commented #751
  • 17:16

    SethTisue on 2.13.x

    ScalaTest: flaky tests are now … (compare)

  • 17:16
    SethTisue closed #1459
  • 17:15
    SethTisue commented #1459
  • 16:30
    SethTisue synchronize #1459
  • 15:57
    SethTisue synchronize #1459
  • 15:02
    dwijnand demilestoned #12314
  • 15:02
    dwijnand milestoned #12314
  • 15:02
    dwijnand unassigned #12314
  • 15:02
    dwijnand commented #12314
  • 09:16
    julienrf commented #2137
  • 09:16

    julienrf on main

    Move scaladoc documentation fro… Change style of h4 and h5 conte… Merge pull request #2137 from s… (compare)

  • 09:16
    julienrf closed #2137
  • 09:00
    julienrf synchronize #2137
  • 08:58
    BarkingBad commented #2137
  • 08:51
    julienrf commented #2137
  • 08:50
    julienrf synchronize #2137
  • 04:08
    som-snytt commented #9687
D Cameron Mauch
@DCameronMauch
I was trying to come up with a generic solution, such that a developer could take any case class T, and do something like df.as[T], without having to do something like have a companion object with the list of fields. Though that would be much more straight forward.
Alessandro
@ImGrayMouser_twitter

Hi everyone,
I was playing with a coding challenge. Basically need to remove duplicates.
My first implementation used mutable Array (because the given method signature was providing and expecting Array ).
To make it short, it all boils down to the following. Given

val a1 = Array(1,2,3)
val a2 = Array(1,2,3)

val l1 = List(1,2,3)
val l2 = List(1,2,3)

scala> a1 == a2
res153: Boolean = false

scala> l1 == l2
res154: Boolean = true

Consequently happens this:

scala> val s1 = Set(a1,a2)
s1: scala.collection.immutable.Set[Array[Int]] = Set(Array(1, 2, 3), Array(1, 2, 3))

scala> val s2 = Set(l1,l2)
s2: scala.collection.immutable.Set[List[Int]] = Set(List(1, 2, 3))

Why Array(s) with same elements do not compare equally or, conversely why List(s) do ???

Is there a way for having Set work as expected also with Array(s) ????
Thanks

Luis Miguel Mejía Suárez
@BalmungSan
@DCameronMauch wow, really? Maybe it would be good to open an issue in Spark, I am pretty sure that isn't the intended behaviour.
@ImGrayMouser_twitter because Arrays are not real collections; they are JVM primitives.
And you shouldn't use them, specially when learning. They are only useful for performance sensitive code.
They are mutable, they are invariant, they do not have a pretty toString nor a sensible equals
Let me guess, you are trying to solve some letcode excercises?
1 reply
D Cameron Mauch
@DCameronMauch
People here have said before that a Dataset is more like a view/projection kind of thing. The underlying data structure is still there, no matter how you view it.
The extra fields get cleaned up with some kind of map operation. But we have lots of cases where we load DataFrames, convert to Datasets, join, etc...
Luis Miguel Mejía Suárez
@BalmungSan
@DCameronMauch I mean I would expect that df.as[T] would do some kind of select to prune unnecessary data.
D Cameron Mauch
@DCameronMauch
The huge issue is we are actually using like 10 columns from data with 800 columns, and when we join, we shuffle, including the extra 790 unused columns.
I was very shocked myself when I found out it does not do that
case class User(id: Long, firstName: String, lastName: String)

val df1: DataFrame = Seq(
  (1L, "John", "Doe", 37, "male"),
  (2L, "Jane", "Doe", 22, "female")
).toDF("id", "firstName", "lastName", "age", "gender")

val ds1: Dataset[User] = df1.as[User]

val df2: DataFrame = ds1.toDF

df2.printSchema
df2.show(2, false)
Yields:
root
 |-- id: long (nullable = false)
 |-- firstName: string (nullable = true)
 |-- lastName: string (nullable = true)
 |-- age: integer (nullable = false)
 |-- gender: string (nullable = true)

+---+---------+--------+---+------+
|id |firstName|lastName|age|gender|
+---+---------+--------+---+------+
|1  |John     |Doe     |37 |male  |
|2  |Jane     |Doe     |22 |female|
+---+---------+--------+---+------+
Luis Miguel Mejía Suárez
@BalmungSan
:0
D Cameron Mauch
@DCameronMauch
Yeah… :-(
I was thinking I could maybe just do df.as[T].map(identity)
Or maybe df.as[T].map(_.copy())
Some invocation of map that returns the same object type, and thus clears out the cruft
Eric K Richardson
@ekrich
I have dealt with this too and it can be a big problem as you say.
Spark can be pretty dicey.
D Cameron Mauch
@DCameronMauch
Aside from the shuffle, if the code looks like spark.load.parquet(“…”).select(…).as[T], then it can avoid even reading the unused columns in the first place
2 replies
I think Spark is not smart enough to do that if I .map after the .as[T]
Eric K Richardson
@ekrich
It is like SQL where you need to specify the projection (columns) other wise you get and from any joins as well.
D Cameron Mauch
@DCameronMauch
Oh yeah, I tested that too. All the extra columns from both joined df’s are present in the output of the join
I really like the efficiency of DataFrames, but like the type safety of Datasets
Eric K Richardson
@ekrich
Totally agree.
D Cameron Mauch
@DCameronMauch
I wish Frameless was a much better maintained project, as it supposedly gives best of both worlds
As is, my team won’t accept it, because they think it’s half-baked, and no one is maintaining it
Eric K Richardson
@ekrich
Actually, there is not suppose to be any performance hit really with datasets.
D Cameron Mauch
@DCameronMauch
The problem is it can’t optimize lambdas
If you create some dataset of T, and do something with it, it can only do that after it reifies the object.
So even if you only use a handful of the fields in the case class, Spark can’t load only those fields from whatever storage
Seth Tisue
@SethTisue
@ImGrayMouser_twitter seconded that learn-you-a-Scala coding challenges should most definitely not be directing you to use Array
D Cameron Mauch
@DCameronMauch
The only place I would except to see Array in a Scala application is in a main method, to satifying the signature.
Luis Miguel Mejía Suárez
@BalmungSan
Not even there if you use IOApp from cats-effect :grimacing:
Also, I believe Scala 3 will have some something similar to one of the Li' libraries to automatically parse the arguments into a case class, so your main method would not see the Array
Matt Hicks
@darkfrog26
Ugh...80% of my open-source projects now released for Scala 3.
Rob Norris
@tpolecat
:tada:
I'm waiting on a few things but I'm almost done.
Eric K Richardson
@ekrich
Nice job!
Life without macros is easier.
Matt Hicks
@darkfrog26
@ekrich sometimes yes, sometimes no
Eric K Richardson
@ekrich
At least for porting, not for what you are doing :smile:
Seth Tisue
@SethTisue
Luis Miguel Mejía Suárez
@BalmungSan
:tada:
gitleet
@gitleet
I'm confused, why does typesafe config return back java collections?
Luis Miguel Mejía Suárez
@BalmungSan
Well mostly because is a Java library.
gitleet
@gitleet
:) ok the typesafe namespace got me confused