Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • 22:58
    SethTisue edited #785
  • 22:58
    SethTisue edited #785
  • 22:38
    NthPortal commented #12188
  • 22:22
    SethTisue review_requested #9771
  • 21:31

    SethTisue on 2.13.x

    get decline on sbt 1.6 (compare)

  • 21:19

    SethTisue on 2.13.x

    get droste on sbt 1.6 (compare)

  • 21:11
    som-snytt commented #782
  • 20:54
    SethTisue commented #782
  • 20:39

    SethTisue on 2.13.x

    move curryhoward to sbt 1.6 (compare)

  • 20:28

    SethTisue on 2.13.x

    update comment unfork scoverage (compare)

  • 20:00

    SethTisue on 2.13.x

    add/update comments (compare)

  • 19:53

    SethTisue on 2.13.x

    add comments (compare)

  • 19:49

    SethTisue on 2.13.x

    add comment (compare)

  • 14:44
    scala-jenkins milestoned #9771
  • 14:44
    lrytz opened #9771
  • 14:02
    lrytz commented #12188
  • 14:02
    lrytz commented #12188
  • 13:58
    lrytz commented #12188
  • 13:42
    matetukacs opened #2193
  • 12:47
    bishabosha opened #2192
Rob Norris
@tpolecat
Which I assume is what you meant by "fields"
D Cameron Mauch
@DCameronMauch
This compiles:
    def asCleaned[U: ClassTag: Encoder](): Dataset[U] = {
      val fields: List[Column] = implicitly[ClassTag[U]].runtimeClass.getDeclaredFields.toList.map(_.getName).map(col)
      ds.select(fields: _*).as[U]
    }
1 reply
It seems to generate the expected list, except in DataBricks, which add some $outer fields to the end...
I’m not sure I understand the difference
I didn’t create some alternative apply
Spark seems to also get this list of fields, and map each column with the right name to a field to then construct the class instance
Rob Norris
@tpolecat
This may work for your specific case but it is very fragile in general.
D Cameron Mauch
@DCameronMauch
So best to stick with the Shapeless solution?
Rob Norris
@tpolecat
I think that would probably be safer.
@ implicitly[ClassTag[String]].runtimeClass.getDeclaredFields.toList.map(_.getName) 
res1: List[String] = List(
  "value",
  "coder",
  "hash",
  "serialVersionUID",
  "COMPACT_STRINGS",
  "serialPersistentFields",
  "CASE_INSENSITIVE_ORDER",
  "LATIN1",
  "UTF16"
)
Those are certainly not the names of the fields of the string constructor.
D Cameron Mauch
@DCameronMauch
Oy, dang
Okay, Shapeless it is
Eric K Richardson
@ekrich
How many lines of code does it take to just select the fields you want and map them into case classes? Or you have so much of it that getting rid of that code is important?
D Cameron Mauch
@DCameronMauch
I was trying to come up with a generic solution, such that a developer could take any case class T, and do something like df.as[T], without having to do something like have a companion object with the list of fields. Though that would be much more straight forward.
Alessandro
@ImGrayMouser_twitter

Hi everyone,
I was playing with a coding challenge. Basically need to remove duplicates.
My first implementation used mutable Array (because the given method signature was providing and expecting Array ).
To make it short, it all boils down to the following. Given

val a1 = Array(1,2,3)
val a2 = Array(1,2,3)

val l1 = List(1,2,3)
val l2 = List(1,2,3)

scala> a1 == a2
res153: Boolean = false

scala> l1 == l2
res154: Boolean = true

Consequently happens this:

scala> val s1 = Set(a1,a2)
s1: scala.collection.immutable.Set[Array[Int]] = Set(Array(1, 2, 3), Array(1, 2, 3))

scala> val s2 = Set(l1,l2)
s2: scala.collection.immutable.Set[List[Int]] = Set(List(1, 2, 3))

Why Array(s) with same elements do not compare equally or, conversely why List(s) do ???

Is there a way for having Set work as expected also with Array(s) ????
Thanks

Luis Miguel Mejía Suárez
@BalmungSan
@DCameronMauch wow, really? Maybe it would be good to open an issue in Spark, I am pretty sure that isn't the intended behaviour.
@ImGrayMouser_twitter because Arrays are not real collections; they are JVM primitives.
And you shouldn't use them, specially when learning. They are only useful for performance sensitive code.
They are mutable, they are invariant, they do not have a pretty toString nor a sensible equals
Let me guess, you are trying to solve some letcode excercises?
1 reply
D Cameron Mauch
@DCameronMauch
People here have said before that a Dataset is more like a view/projection kind of thing. The underlying data structure is still there, no matter how you view it.
The extra fields get cleaned up with some kind of map operation. But we have lots of cases where we load DataFrames, convert to Datasets, join, etc...
Luis Miguel Mejía Suárez
@BalmungSan
@DCameronMauch I mean I would expect that df.as[T] would do some kind of select to prune unnecessary data.
D Cameron Mauch
@DCameronMauch
The huge issue is we are actually using like 10 columns from data with 800 columns, and when we join, we shuffle, including the extra 790 unused columns.
I was very shocked myself when I found out it does not do that
case class User(id: Long, firstName: String, lastName: String)

val df1: DataFrame = Seq(
  (1L, "John", "Doe", 37, "male"),
  (2L, "Jane", "Doe", 22, "female")
).toDF("id", "firstName", "lastName", "age", "gender")

val ds1: Dataset[User] = df1.as[User]

val df2: DataFrame = ds1.toDF

df2.printSchema
df2.show(2, false)
Yields:
root
 |-- id: long (nullable = false)
 |-- firstName: string (nullable = true)
 |-- lastName: string (nullable = true)
 |-- age: integer (nullable = false)
 |-- gender: string (nullable = true)

+---+---------+--------+---+------+
|id |firstName|lastName|age|gender|
+---+---------+--------+---+------+
|1  |John     |Doe     |37 |male  |
|2  |Jane     |Doe     |22 |female|
+---+---------+--------+---+------+
Luis Miguel Mejía Suárez
@BalmungSan
:0
D Cameron Mauch
@DCameronMauch
Yeah… :-(
I was thinking I could maybe just do df.as[T].map(identity)
Or maybe df.as[T].map(_.copy())
Some invocation of map that returns the same object type, and thus clears out the cruft
Eric K Richardson
@ekrich
I have dealt with this too and it can be a big problem as you say.
Spark can be pretty dicey.
D Cameron Mauch
@DCameronMauch
Aside from the shuffle, if the code looks like spark.load.parquet(“…”).select(…).as[T], then it can avoid even reading the unused columns in the first place
2 replies
I think Spark is not smart enough to do that if I .map after the .as[T]
Eric K Richardson
@ekrich
It is like SQL where you need to specify the projection (columns) other wise you get and from any joins as well.
D Cameron Mauch
@DCameronMauch
Oh yeah, I tested that too. All the extra columns from both joined df’s are present in the output of the join
I really like the efficiency of DataFrames, but like the type safety of Datasets
Eric K Richardson
@ekrich
Totally agree.
D Cameron Mauch
@DCameronMauch
I wish Frameless was a much better maintained project, as it supposedly gives best of both worlds
As is, my team won’t accept it, because they think it’s half-baked, and no one is maintaining it
Eric K Richardson
@ekrich
Actually, there is not suppose to be any performance hit really with datasets.
D Cameron Mauch
@DCameronMauch
The problem is it can’t optimize lambdas
If you create some dataset of T, and do something with it, it can only do that after it reifies the object.
So even if you only use a handful of the fields in the case class, Spark can’t load only those fields from whatever storage
Seth Tisue
@SethTisue
@ImGrayMouser_twitter seconded that learn-you-a-Scala coding challenges should most definitely not be directing you to use Array
D Cameron Mauch
@DCameronMauch
The only place I would except to see Array in a Scala application is in a main method, to satifying the signature.