Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • Jul 22 13:43
    codecov[bot] commented #544
  • Jul 22 13:13
    scala-steward opened #544
  • Jul 16 06:50

    larsrh on master

    Update sbt to 1.5.5 Merge pull request #543 from sc… (compare)

  • Jul 16 06:50
    larsrh closed #543
  • Jul 12 13:30
    codecov[bot] commented #543
  • Jul 12 13:02
    scala-steward opened #543
  • Jun 14 19:25
    armanbilge commented #541
  • Jun 14 19:24
    armanbilge commented #541
  • Jun 14 19:20
    larsrh commented #541
  • Jun 14 19:18
    larsrh closed #474
  • Jun 14 19:17

    larsrh on master

    Migrate docs from tut to mdoc Uncomment line (forgotten from … Flatten docs and fix scripts and 1 more (compare)

  • Jun 14 19:17
    larsrh closed #541
  • Jun 14 19:17
    larsrh commented #541
  • Jun 14 18:53
    armanbilge commented #541
  • Jun 14 18:32
    codecov[bot] commented #541
  • Jun 14 18:02
    larsrh commented #541
  • Jun 14 18:01
    codecov[bot] commented #541
  • Jun 14 18:01
    codecov[bot] commented #541
  • Jun 14 18:01
    armanbilge synchronize #541
  • Jun 14 18:00
    larsrh commented #541
D Cameron Mauch
@DCameronMauch
But I don’t have access to our S3 stuff from Zeppelin
Yeah, I can give it a whirl
Just hard to give a presentation to my team on switching to this when I can’t get it to work in my notebook
The AWS EMR is a custom Spark, right? They did some changes of some kind
marios iliofotou
@imarios
Yeah. I am afraid that frameless needs access to some internal apis that are written differently on these custom version of spark (Databricks, EMR)
I thought I had it in our docs but now I see that we never added it there. Will try to fix the docs.
Ayoub
@I_am_ayoub_twitter
This seems new to me. We use frameless on EMR since at least 2 years without compatibility issues
marios iliofotou
@imarios
@I_am_ayoub_twitter that’s great to hear. It might just be Databricks then.
D Cameron Mauch
@DCameronMauch
@imarios Can you give me some details or point me to where these internal APIs are used? Maybe I can figure out how to get them working with Databricks and submit a PR.
I would have a really hard sell if I couldn’t show it working in Databricks
marios iliofotou
@imarios
@DCameronMauch let me look into this to find out those parts.
marios iliofotou
@imarios
Hello! I just created a PR (#479) to improve the syntax when working with Option columns in Frameless. If anyone wants to help with the code review please let me know.
It essentially tries to address the issues mentioned in #204
case class X(i: Option[Int], j: Option[String])
val t = TypedDataset.create(Seq(X(Some(1), None), X(None, None), X(Some(10), Some("foo"))))
 t.select(t('i).opt.map(_*2), t('j).opt.map(_.startsWith("fo"))).show().run()
+----+----+
|  _1|  _2|
+----+----+
|   2|null|
|null|null|
|  20|true|
+----+----+
Mauricio Jost
@mauriciojost
Hello!
In my company we're trying to figure out a way to deal with schema evolution while using frameless.
Our BOMs are case classes that are semantically versioned (major: corresponds with renamed / dropped fields somewhere, minor: corresponds to new fields added, patch: no case class change but something else). We're using frameless 0.8.0 / 0.7.0.
I was wondering what we can expect when reading a parquet with a case class that has less attributes (along its tree) than the case class used to write such parquet (both via frameless)? (from a few simple tests it seems that this fails in few scenarios on top-level-attribute changes, but works in some other scenarios with non-top-level-attribute changes)
marios iliofotou
@imarios
Hey Mauricio, if they are missing fields, it’s better if those fields are optional.
Mauricio Jost
@mauriciojost

Hi @imarios , thanks, yes, we want to avoid nulls too. Just to clarify:

  1. writing case class A(a, b, c) and reading case class A(b, c) does not work (there is a check at TypedDataset.createUnsafe on number of top level attributes)
  2. however this useful feature seems to work:

    case class DetailBComplete(b: String, c: String)
    case class DetailCComplete(c: String, d: String)
    case class PaxComplete(a: String, b: List[DetailBComplete], c: DetailCComplete)
    
    case class DetailBReduced(b: String /*, c: String*/)
    case class PaxWithBReduced(a: String, b: List[DetailBReduced], c: DetailCComplete)
    
    it should "write PaxComplete record and read using PaxWithBReduced class (with less attributes)" in tmpDir {  (spark, path) =>
     val p = PaxComplete("name", List(DetailBComplete("bbb", "ccc")), DetailCComplete("ccccc", "ddddd"))
     val pexpected = PaxWithBReduced(p.a, p.b.map(i => DetailBReduced(i.b)), p.c)
     TypedDataset.create[PaxComplete](spark.sparkContext.parallelize[PaxComplete](Seq(p))).write.parquet(path)
     TypedDataset.createUnsafe[PaxWithBReduced](spark.read.parquet(path)).rdd.collect should equal(Array(pexpected))
    }

    Is this intended? Or just a non-addressed use case that works by chance?

2 replies
marios iliofotou
@imarios
Let me look at this and get back to you
Mauricio Jost
@mauriciojost
Thanks @imarios , if it helps, our datasets have always the same schema / BOM version (even patch version, i.e. we don't mix versions, we use the same version along the whole lifecycle of the parquet path where the dataset is written)
Alberto Di Savia Puglisi
@disalberto
In other words, what you're asking @mauriciojost is: can you safely assume that we can read a case class with less or same number of attributes than the case class used to write the parquet (at any level within the case class tree, existent attributes respect exactly the same type)? The only exception would be the root level because of that check on the number of attributes that should be exactly the same.
Alex Astakhov
@al3xastakhov
Hi folks,
I'm new to frameless, could you please suggest is there any way to perform the following select after join without reassigning join result to a variable?
df1
  .joinLeft(df2)(df1('_1) === df2('_1))
  .select($"_1._1", $"_1._2", $"_2._2") // vanilla syntax
  .as[MyType]
  .map(...)
marios iliofotou
@imarios
@al3xastakhov my first impression is no, I think you will have to assign it to a variable for now. For future additions, I will check the syntax and see if there is an easier way to go about this.
marios iliofotou
@imarios
@al3xastakhov just confirming that is the only way to do this right now in Frameless.
Alex Astakhov
@al3xastakhov
@imarios okay, another follow-up then: even if I extract join result to variable, how do i query _1._1? Looks like $"".typedColumn[...] \ cast to spark dataset is inevitable?
marios iliofotou
@imarios
Use colMany to access nested schema
Alex Astakhov
@al3xastakhov
cool, thanks!
marios iliofotou
@imarios
Your welcome
Ayoub
@I_am_ayoub_twitter
Hi, there are a couple of fixes on Master not released yet. Is a new released planed ?
marios iliofotou
@imarios
@I_am_ayoub_twitter I think we can cut a new release. We need to update the docs as well. I will into that later today
Look into that later today
Ayoub
@I_am_ayoub_twitter
thanks !
marios iliofotou
@imarios
@I_am_ayoub_twitter I just updated the docs. Will start the process for a new build
1 reply
Alex Astakhov
@al3xastakhov
Hi folks,
How do I write correct Injection for this:
ds.as[ClassWithEvent] // Unable to find encoder for type ClassWithEvent

case class ClassWithEvent(events: List[Event])

sealed trait Event extends Product with Serializable
case class EventA(a: Int, b: Int) extends Event
case class EventB(a: List[String]) extends Event
case class EventC(a: List[Other]) extends Event

sealed trait Other extends Product with Serializable
case object OtherState1 extends Other
case object OtherState2 extends Other
D Cameron Mauch
@DCameronMauch
Here is a fork/commit someone came up with to allow Frameless to work with DataBricks: dsabanin/frameless@1ce8f0b
I confirmed that this commit applied to the 0.9.0 tag worked with the latest version of DataBricks cluster - Spark 3.0.1 / Scala 2.12
1 reply
Ayoub
@I_am_ayoub_twitter
Thanks for sharing
Daniel Schoepe
@dschoepe
Is there a way to select on the result of a joinInner without introducing a new variable for it? E.g. var x = dataset.joinInner(y)(???); x.select(x('_2)) works but I was wondering if there's a way around having to introduce the name x. Alternatively, is there a version of join that returns the matching entries data set that was passed as argument, i.e. y in the above example?
2 replies
marios iliofotou
@imarios
@dschoepe you can use a join followed by a projection (select). I don’t think there is a less verbose way of doing this
avandel
@avandel

Hello
I write this Injection for my Enum

sealed trait State extends EnumEntry

object State extends Enum[State] {
  val values = findValues

  case object Actif extends State
  case object Inactif extends State

  implicit val _injection: Injection[State, String] = Injection(
    {
      case Actif   => Actif.entryName
      case Inactif => Inactif.entryName
    },
    {
      case "Actif"   => Actif
      case "Inactif" => Inactif
    }
  )

  implicit val _encoder: TypedEncoder[State] =
    frameless.TypedEncoder.usingInjection[State, String]
}

But when I use it in a select or withColumn like this

 ds.select(
        lit(State.Actif),
...

I have this errorcould not find implicit value for evidence parameter of type frameless.TypedEncoder[State.Actif.type]
what did I miss?
Thank you for your help

Cédric Chantepie
@cchantep
Hi, feedback/hint to use ValueClass with TypedEncoder would be welcome: typelevel/frameless#516
Cédric Chantepie
@cchantep
^_ Do not whetherthat's an encoder issue or only when combined with UDF
Patrick GRANDJEAN
@pgrandjean

Hi, does anyone has a problem with frameless on Databricks?

Exception in thread "main" java.lang.VerifyError: class org.apache.spark.sql.FramelessInternals$DisambiguateRight overrides final method genCode.(Lorg/apache/spark/sql/catalyst/expressions/codegen/CodegenContext;)Lorg/apache/spark/sql/catalyst/expressions/codegen/ExprCode;

I'm using Spark 2.4 + Scala 2.11 with Databricks runtime apache-spark-2.4.x-scala2.11.

marios iliofotou
@imarios
@pgrandjean can you check if this fix helps? dsabanin/frameless@1ce8f0b
The plan is to see if we can merge this into the next version. In the meanwhile, you can build and release your own version with the fix (if that's an option you have)
Patrick GRANDJEAN
@pgrandjean
@imarios ok, thanks!
Patrick GRANDJEAN
@pgrandjean
Hi @imarios, when running tests with the modified code, I am getting "not implemented" errors
related to protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = ??? in dataset/src/main/scala/frameless/functions/Lit.scala
marios iliofotou
@imarios
@pgrandjean let's open an issue ticket if you have the time. That would be better so we don't lose track of this. I can try to take a closer look as soon as I can
Patrick GRANDJEAN
@pgrandjean
ok
marios iliofotou
@imarios
ty
Patrick GRANDJEAN
@pgrandjean
done