Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • 14:53
    lrytz edited #12463
  • 14:53
    lrytz milestoned #12463
  • 14:53
    lrytz assigned #12463
  • 14:53
    lrytz labeled #12463
  • 14:53
    lrytz opened #12463
  • 12:59
    retronym labeled #9761
  • 08:51
    scala-jenkins milestoned #9761
  • 08:51
    lrytz review_requested #9761
  • 08:51
    lrytz opened #9761
  • 05:39
    nwk37011 synchronize #9752
  • 05:38
    nwk37011 synchronize #9752
  • Sep 15 20:28

    SethTisue on 2.12.x

    2.12: new Scala SHA (post-2.12.… (compare)

  • Sep 15 20:28
    SethTisue closed #1485
  • Sep 15 19:03
    NthPortal synchronize #9388
  • Sep 15 17:07
    julienrf commented #9388
  • Sep 15 16:37
    SethTisue commented #9388
  • Sep 15 16:36

    SethTisue on 2.12.x

    Fix BTypes.LONG.maxValueType(BT… Refactor BTypesTest#typedOpcodes Use AssertUtil.assertThrows ins… and 1 more (compare)

  • Sep 15 16:36
    SethTisue commented #9751
  • Sep 15 16:36
    SethTisue closed #9751
  • Sep 15 16:36
    SethTisue commented #9751
Rob Norris
@tpolecat
Doesn't matter.
D Cameron Mauch
@DCameronMauch
There is the setting right there for maximum heap size
I just change that
Thanks for the pointer
Rob Norris
@tpolecat
In Scala 3 you can do it with Mirror but for Scala 2 you need Shapeless.
An awful lot of questions here (and elsewhere) are about things that are at the edge of what the language can do, and these are exactly the kinds of things that changed in Scala 3. So I think a lot of questions are going to have two answers for a while.
D Cameron Mauch
@DCameronMauch
This seems to work:
implicit class DatasetOps[T: Encoder](ds: Dataset[T]) {
    def asCleaned[U: Encoder](): Dataset[U] = {
      ds.select(
        classOf[U]
          .getDeclaredFields
          .toList
          .map(_.getName)
          .map(col): _*
      ).as[U]
    }
}
Spark is weird. If you have a DataFrame with 10 columns, and convert it to a Dataset of some case class with 6 fields, those extra 4 column are still there. Taking space, slowing down shuffles, etc. This is my attempt at removing all the crud.
D Cameron Mauch
@DCameronMauch
Ah, looks like the above doesn’t compile, though IntelliJ is not showing any errors
Rob Norris
@tpolecat
The declared fields are not necessarily the same thing as the primary constructor arguments.
Which I assume is what you meant by "fields"
D Cameron Mauch
@DCameronMauch
This compiles:
    def asCleaned[U: ClassTag: Encoder](): Dataset[U] = {
      val fields: List[Column] = implicitly[ClassTag[U]].runtimeClass.getDeclaredFields.toList.map(_.getName).map(col)
      ds.select(fields: _*).as[U]
    }
1 reply
It seems to generate the expected list, except in DataBricks, which add some $outer fields to the end...
I’m not sure I understand the difference
I didn’t create some alternative apply
Spark seems to also get this list of fields, and map each column with the right name to a field to then construct the class instance
Rob Norris
@tpolecat
This may work for your specific case but it is very fragile in general.
D Cameron Mauch
@DCameronMauch
So best to stick with the Shapeless solution?
Rob Norris
@tpolecat
I think that would probably be safer.
@ implicitly[ClassTag[String]].runtimeClass.getDeclaredFields.toList.map(_.getName) 
res1: List[String] = List(
  "value",
  "coder",
  "hash",
  "serialVersionUID",
  "COMPACT_STRINGS",
  "serialPersistentFields",
  "CASE_INSENSITIVE_ORDER",
  "LATIN1",
  "UTF16"
)
Those are certainly not the names of the fields of the string constructor.
D Cameron Mauch
@DCameronMauch
Oy, dang
Okay, Shapeless it is
Eric K Richardson
@ekrich
How many lines of code does it take to just select the fields you want and map them into case classes? Or you have so much of it that getting rid of that code is important?
D Cameron Mauch
@DCameronMauch
I was trying to come up with a generic solution, such that a developer could take any case class T, and do something like df.as[T], without having to do something like have a companion object with the list of fields. Though that would be much more straight forward.
Alessandro
@ImGrayMouser_twitter

Hi everyone,
I was playing with a coding challenge. Basically need to remove duplicates.
My first implementation used mutable Array (because the given method signature was providing and expecting Array ).
To make it short, it all boils down to the following. Given

val a1 = Array(1,2,3)
val a2 = Array(1,2,3)

val l1 = List(1,2,3)
val l2 = List(1,2,3)

scala> a1 == a2
res153: Boolean = false

scala> l1 == l2
res154: Boolean = true

Consequently happens this:

scala> val s1 = Set(a1,a2)
s1: scala.collection.immutable.Set[Array[Int]] = Set(Array(1, 2, 3), Array(1, 2, 3))

scala> val s2 = Set(l1,l2)
s2: scala.collection.immutable.Set[List[Int]] = Set(List(1, 2, 3))

Why Array(s) with same elements do not compare equally or, conversely why List(s) do ???

Is there a way for having Set work as expected also with Array(s) ????
Thanks

Luis Miguel Mejía Suárez
@BalmungSan
@DCameronMauch wow, really? Maybe it would be good to open an issue in Spark, I am pretty sure that isn't the intended behaviour.
@ImGrayMouser_twitter because Arrays are not real collections; they are JVM primitives.
And you shouldn't use them, specially when learning. They are only useful for performance sensitive code.
They are mutable, they are invariant, they do not have a pretty toString nor a sensible equals
Let me guess, you are trying to solve some letcode excercises?
1 reply
D Cameron Mauch
@DCameronMauch
People here have said before that a Dataset is more like a view/projection kind of thing. The underlying data structure is still there, no matter how you view it.
The extra fields get cleaned up with some kind of map operation. But we have lots of cases where we load DataFrames, convert to Datasets, join, etc...
Luis Miguel Mejía Suárez
@BalmungSan
@DCameronMauch I mean I would expect that df.as[T] would do some kind of select to prune unnecessary data.
D Cameron Mauch
@DCameronMauch
The huge issue is we are actually using like 10 columns from data with 800 columns, and when we join, we shuffle, including the extra 790 unused columns.
I was very shocked myself when I found out it does not do that
case class User(id: Long, firstName: String, lastName: String)

val df1: DataFrame = Seq(
  (1L, "John", "Doe", 37, "male"),
  (2L, "Jane", "Doe", 22, "female")
).toDF("id", "firstName", "lastName", "age", "gender")

val ds1: Dataset[User] = df1.as[User]

val df2: DataFrame = ds1.toDF

df2.printSchema
df2.show(2, false)
Yields:
root
 |-- id: long (nullable = false)
 |-- firstName: string (nullable = true)
 |-- lastName: string (nullable = true)
 |-- age: integer (nullable = false)
 |-- gender: string (nullable = true)

+---+---------+--------+---+------+
|id |firstName|lastName|age|gender|
+---+---------+--------+---+------+
|1  |John     |Doe     |37 |male  |
|2  |Jane     |Doe     |22 |female|
+---+---------+--------+---+------+
Luis Miguel Mejía Suárez
@BalmungSan
:0
D Cameron Mauch
@DCameronMauch
Yeah… :-(
I was thinking I could maybe just do df.as[T].map(identity)
Or maybe df.as[T].map(_.copy())
Some invocation of map that returns the same object type, and thus clears out the cruft
Eric K Richardson
@ekrich
I have dealt with this too and it can be a big problem as you say.
Spark can be pretty dicey.
D Cameron Mauch
@DCameronMauch
Aside from the shuffle, if the code looks like spark.load.parquet(“…”).select(…).as[T], then it can avoid even reading the unused columns in the first place
2 replies
I think Spark is not smart enough to do that if I .map after the .as[T]
Eric K Richardson
@ekrich
It is like SQL where you need to specify the projection (columns) other wise you get and from any joins as well.