Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • Oct 23 11:00
    LLCampos opened #2209
  • Oct 22 22:54
    SethTisue labeled #9790
  • Oct 22 22:54
    scala-jenkins milestoned #9790
  • Oct 22 22:54
    SethTisue opened #9790
  • Oct 22 21:11
    krasinski commented #12475
  • Oct 22 21:10
    krasinski commented #12475
  • Oct 22 17:55
    SethTisue assigned #1499
  • Oct 22 17:49
    SethTisue opened #1499
  • Oct 22 17:47
    SethTisue edited #795
  • Oct 22 17:47
    SethTisue edited #795
  • Oct 22 17:47
    SethTisue edited #795
  • Oct 22 17:46
    SethTisue demilestoned #754
  • Oct 22 17:46
    SethTisue demilestoned #577
  • Oct 22 17:46
    SethTisue commented #9697
  • Oct 22 17:45
    SethTisue edited #795
  • Oct 22 17:44
    SethTisue edited #795
  • Oct 22 17:44
    SethTisue assigned #795
  • Oct 22 17:44
    SethTisue opened #795
  • Oct 22 17:08
    tlazaro commented #12038
  • Oct 22 15:55
    SethTisue commented #470
Jim Newton
@jimka2001
Sorry, but I don't know where to put -Xmx8g at
Rob Norris
@tpolecat
On the commandline this is what you do. I don't know how to tell IntelliJ to do it. My guess is that it's in the run configuration for your tests.
Someone here who uses IJ can tell you.
D Cameron Mauch
@DCameronMauch
I’m trying to figure out how to create a method with this signature: def fields[T](): List[String] where the output is a list of the field names in a case class T
Rob Norris
@tpolecat
You can do that with shapeless.
D Cameron Mauch
@DCameronMauch
@jimka2001 “IntelliJ IDEA” -> “preferences” -> “build, execute, deploy” -> “compiler” -> “scala compiler” -> “scala compile server"
Jim Newton
@jimka2001
I can add -Xmx8g to the VM options? beginning? end? doesn't matter?
Rob Norris
@tpolecat
Doesn't matter.
D Cameron Mauch
@DCameronMauch
There is the setting right there for maximum heap size
I just change that
Thanks for the pointer
Rob Norris
@tpolecat
In Scala 3 you can do it with Mirror but for Scala 2 you need Shapeless.
An awful lot of questions here (and elsewhere) are about things that are at the edge of what the language can do, and these are exactly the kinds of things that changed in Scala 3. So I think a lot of questions are going to have two answers for a while.
D Cameron Mauch
@DCameronMauch
This seems to work:
implicit class DatasetOps[T: Encoder](ds: Dataset[T]) {
    def asCleaned[U: Encoder](): Dataset[U] = {
      ds.select(
        classOf[U]
          .getDeclaredFields
          .toList
          .map(_.getName)
          .map(col): _*
      ).as[U]
    }
}
Spark is weird. If you have a DataFrame with 10 columns, and convert it to a Dataset of some case class with 6 fields, those extra 4 column are still there. Taking space, slowing down shuffles, etc. This is my attempt at removing all the crud.
D Cameron Mauch
@DCameronMauch
Ah, looks like the above doesn’t compile, though IntelliJ is not showing any errors
Rob Norris
@tpolecat
The declared fields are not necessarily the same thing as the primary constructor arguments.
Which I assume is what you meant by "fields"
D Cameron Mauch
@DCameronMauch
This compiles:
    def asCleaned[U: ClassTag: Encoder](): Dataset[U] = {
      val fields: List[Column] = implicitly[ClassTag[U]].runtimeClass.getDeclaredFields.toList.map(_.getName).map(col)
      ds.select(fields: _*).as[U]
    }
1 reply
It seems to generate the expected list, except in DataBricks, which add some $outer fields to the end...
I’m not sure I understand the difference
I didn’t create some alternative apply
Spark seems to also get this list of fields, and map each column with the right name to a field to then construct the class instance
Rob Norris
@tpolecat
This may work for your specific case but it is very fragile in general.
D Cameron Mauch
@DCameronMauch
So best to stick with the Shapeless solution?
Rob Norris
@tpolecat
I think that would probably be safer.
@ implicitly[ClassTag[String]].runtimeClass.getDeclaredFields.toList.map(_.getName) 
res1: List[String] = List(
  "value",
  "coder",
  "hash",
  "serialVersionUID",
  "COMPACT_STRINGS",
  "serialPersistentFields",
  "CASE_INSENSITIVE_ORDER",
  "LATIN1",
  "UTF16"
)
Those are certainly not the names of the fields of the string constructor.
D Cameron Mauch
@DCameronMauch
Oy, dang
Okay, Shapeless it is
Eric K Richardson
@ekrich
How many lines of code does it take to just select the fields you want and map them into case classes? Or you have so much of it that getting rid of that code is important?
D Cameron Mauch
@DCameronMauch
I was trying to come up with a generic solution, such that a developer could take any case class T, and do something like df.as[T], without having to do something like have a companion object with the list of fields. Though that would be much more straight forward.
Alessandro
@ImGrayMouser_twitter

Hi everyone,
I was playing with a coding challenge. Basically need to remove duplicates.
My first implementation used mutable Array (because the given method signature was providing and expecting Array ).
To make it short, it all boils down to the following. Given

val a1 = Array(1,2,3)
val a2 = Array(1,2,3)

val l1 = List(1,2,3)
val l2 = List(1,2,3)

scala> a1 == a2
res153: Boolean = false

scala> l1 == l2
res154: Boolean = true

Consequently happens this:

scala> val s1 = Set(a1,a2)
s1: scala.collection.immutable.Set[Array[Int]] = Set(Array(1, 2, 3), Array(1, 2, 3))

scala> val s2 = Set(l1,l2)
s2: scala.collection.immutable.Set[List[Int]] = Set(List(1, 2, 3))

Why Array(s) with same elements do not compare equally or, conversely why List(s) do ???

Is there a way for having Set work as expected also with Array(s) ????
Thanks

Luis Miguel Mejía Suárez
@BalmungSan
@DCameronMauch wow, really? Maybe it would be good to open an issue in Spark, I am pretty sure that isn't the intended behaviour.
@ImGrayMouser_twitter because Arrays are not real collections; they are JVM primitives.
And you shouldn't use them, specially when learning. They are only useful for performance sensitive code.
They are mutable, they are invariant, they do not have a pretty toString nor a sensible equals
Let me guess, you are trying to solve some letcode excercises?
1 reply
D Cameron Mauch
@DCameronMauch
People here have said before that a Dataset is more like a view/projection kind of thing. The underlying data structure is still there, no matter how you view it.
The extra fields get cleaned up with some kind of map operation. But we have lots of cases where we load DataFrames, convert to Datasets, join, etc...
Luis Miguel Mejía Suárez
@BalmungSan
@DCameronMauch I mean I would expect that df.as[T] would do some kind of select to prune unnecessary data.
D Cameron Mauch
@DCameronMauch
The huge issue is we are actually using like 10 columns from data with 800 columns, and when we join, we shuffle, including the extra 790 unused columns.
I was very shocked myself when I found out it does not do that
case class User(id: Long, firstName: String, lastName: String)

val df1: DataFrame = Seq(
  (1L, "John", "Doe", 37, "male"),
  (2L, "Jane", "Doe", 22, "female")
).toDF("id", "firstName", "lastName", "age", "gender")

val ds1: Dataset[User] = df1.as[User]

val df2: DataFrame = ds1.toDF

df2.printSchema
df2.show(2, false)
Yields:
root
 |-- id: long (nullable = false)
 |-- firstName: string (nullable = true)
 |-- lastName: string (nullable = true)
 |-- age: integer (nullable = false)
 |-- gender: string (nullable = true)

+---+---------+--------+---+------+
|id |firstName|lastName|age|gender|
+---+---------+--------+---+------+
|1  |John     |Doe     |37 |male  |
|2  |Jane     |Doe     |22 |female|
+---+---------+--------+---+------+
Luis Miguel Mejía Suárez
@BalmungSan
:0
D Cameron Mauch
@DCameronMauch
Yeah… :-(
I was thinking I could maybe just do df.as[T].map(identity)