Where communities thrive

  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
Repo info
    Jelmer Kuperus
    it makes no sense that it does this automatically. on some heuristic to me... it means in every different dataset could up with a different type.. that's awful
    Jelmer Kuperus
    It seems there is a ascharacter method in python but i cannot find its scala equivalent
    Jelmer Kuperus

    I am lost . I tried changing the frame with


    h2oFrame((name: String, vec: Vec) => vec.toStringVec, Array("item"))
    val vec = h2oFrame.vec("item")
    h2oFrame.replace(h2oFrame.find("item"), vec.toStringVec)
    which seems to work until i pass. it to word2vec. and it crashes on a nullpointer
    because. h2oFrame._key.get.vec(0) is now null
    Jelmer Kuperus
    ugh if i do DKV.put(h2oFrame) it works...
    Jelmer Kuperus
    Latest attempt
    h2oContext.asH2OFrame(sequences, "sequences")((name: String, vec: Vec) => vec.toStringVec(), Array("item")).update()
    Jelmer Kuperus
    this thing is a nullpointer factory.. if i call word2Vec.toFrame and try to convert that back to a spark df it crashes because this frame has a null key, how do you fix that
    Michal Kurka

    basically anytime you make a change on an H2O object (eg. Frame) you need to reflect the updates to the distributed memory, otherwise all kinds of weird things can happen

    one way is to do DKV.put(frame) or follow the write-lock - update pattern, see eg.:

    public Frame toCategoricalCol(int columIdx){
    replace(columIdx, vec(columIdx).toCategoricalVec()).remove();
    // Update frame in DKV
    return this;

    the low level API can be intimidating, that is part of the reason why we have a more user friendly API in Sparkling Water
    Seems like we are missing a feature that would let you specify target column type when you convert Spark frame to H2O. Please feel free to request this improvement.
    Jelmer Kuperus
    @michalkurka is calling .update() on that frame. a. valid way too ? for me that seemed to work and looked less ugly than updating dkv itself
    Michal Kurka
    update() is valid, however, keep in mind that the update object needs to be write-locked
    Jelmer Kuperus
    Being new to h2o, i have no idea what that means :-)
    Michal Kurka
    see the example above showing the implementation of toCategoricalCol
    Jelmer Kuperus
    created https://0xdata.atlassian.net/browse/SW-2441 as an improvement request
    @michalkurka do you reckon h2oContext.asDataFrame crashing on a frame with a null. key is a. bug ? or is that just me getting lost in low level api land
    Michal Kurka
    I don’t recall such a bug - but I know the low level api sometimes leads to NPEs like that
    Jelmer Kuperus
    Yeah it seems really brittle and poorly designed. But there does not seem to be a highlevel scala or java api
    Despite that being its underpinnings
    Marek Novotný
    @jelmerk What's your SW version? If you want to use the internal Java API on Sparkling Water 3.30+, set the spark.ext.h2o.rest.api.based.client property to false. By default, SW in newer versions runs a thin client on the Spark driver which doesn't have direct access to H2O DKV. So if you create a h2o frame on the Spark driver with the internal Java API it won't get to the cluster with default settings and you will experience 'a frame with a null key' errors as you mentioned.
    Jelmer Kuperus
    @mn-mikke in this case i called toFrame on a Word2VecModel and it produces a frame with a null key (by design i think)
    i simply wanted to convert this dataframe to a spark dataframe and it gave an error
    if i use h2oContext.asDataFrame(new Frame(w2v.toFrame)) it works
    Marek Novotný
    @jelmerk Can you share your code with a Word2VecModel
    basically i just want a list of words + word embeddings as a spark dataframe
    if i look at water.Keyed it says Key mapping a Value which holds this object; may be null
    Jelmer Kuperus
    yet asDataFrame crashes when it is
    Michal Kurka

    Word2VecModel#toFrame does return a Frame with no key - that is intentional. H2O functions rarelly output keyed Frames because that would restrict the way you can work with the Frame. Not all Frames need to be in DKV, this is very heavily utilizid during model building when frames are adapted but the changes do not need to surface to the client/user.

    In your case you need to assign a new key and issue a DKV put.

    Jelmer Kuperus
    Is that more efficient than creating a new frame like i do above
    Michal Kurka
    not really more efficient, just different - the constructor you are using has a slightly different purpose but it fine for your use case
    also let me correct myself; you don’t need to install it in DKV either way if you are just going to do asDataFrame
    what you are doing is fine
    Sergio Calderón Pérez-Lozao

    hi! I´m trying to use pysparkling water from a jupyter notebook within a EMR cluster (aws) and I´m getting this error:

    Py4JJavaError: An error occurred while calling o124.asH2OFrameKeyString.
    : ai.h2o.sparkling.backend.exceptions.RestApiCommunicationException: External H2O node responded with
    Status code: 500 : Server Error

    pyspark version: 2.3.2
    pysparkling version: ''
    36 replies
    Sergio Calderón Pérez-Lozao
    Jelmer Kuperus
    What does sparkling water use driver memory for ? it seems to need a lot and i'd think most of the work would be on the executors
    13 replies
    Cheng WeI
    Hi All, i am running on SW on databricks and there is some problem.
    when i run hex_dt = hc.asH2OFrame(dt_model), it returns error:
    Py4JJavaError: An error occurred while calling o520.asH2OFrame.
    : ai.h2o.sparkling.backend.exceptions.RestApiCommunicationException: H2O node responded with
    Status code: 500 : Server Error
    does anyone know what is causing the error? dt_model is a spark dataframe
    12 replies
    Cheng WeI
    Juan Campos
    @mn-mikke I'm trying to compile sparkling-water using github master branch, because I'd like to use the fixes for SW-2470 and SW-2476, before you build a new release. Currently I get errors when I compile. Do you know if there is a branch that I could use???? Thanks in advance
    4 replies

    Hello, I'm trying to train Xgboost model on EMR 6.1.0 but getting an error: "POST /3/ModelBuilders/xgboost not found"

    I use the follow params:

    spark-shell --master yarn --driver-memory 30G --driver-cores 30 --executor-cores 5 --executor-memory 8G --num-executors 60  --conf spark.yarn.am.memoryOverhead=36G --conf spark.executor.memoryOverhead=10G --conf spark.sql.shuffle.partitions=1200  --conf spark.serializer=org.apache.spark.serializer.KryoSerializer  --conf spark.dynamicAllocation.enabled=false --conf spark.locality.wait=0 --conf spark.scheduler.minRegisteredResourcesRatio=1  --conf spark.ext.h2o.allow_insecure_xgboost=true --packages ai.h2o:sparkling-water-package_2.12:
    val aggRes = spark.read.parquet("s3://blah/blah")
    import ai.h2o.sparkling.H2OContext
    import ai.h2o.sparkling.ml.algos.H2OXGBoost
    val h2o = H2OContext.getOrCreate()
    case class ModelParams() {
        val sampleRate = Some(0.6)
        val subsample = Some(0.6)
        val nTrees = Some(140)
        val maxDepth = Some(10)
        val nFolds = Some(5)
        val eta = Some(0.05)
        val learnRate = Some(0.05)
        val gamma = Some(0F)
        val maxDeltaStep = Some(0F)
        val colsampleRate = Some(0.8)
        val colsampleByLevel = Some(0.8)
        val colsampleRatePerTree = Some(0.8)
        val colsampleBytree = Some(0.8)
        val regLambda = Some(1.0F)
        val minChildWeight = Some(5)
        val minRows = Some(5)
        val regAlpha: Option[Float] = None
    val modelParams = ModelParams()
    val xgb = new H2OXGBoost()
        .setBackend("cpu") // Need to specify cpu, otherwise will default to CUDA
    val model = xgb.fit(aggRes)

    Did someone face with that error?

    4 replies
    Giordano Alvari
    Hello, I am trying to run pysparkling directly from a notebook. I have executed all the step highlighted in the installation but I keep getting this error: cannot import name 'Context' from 'pysparkling.context' (path_env/lib/python3.7/site-packages/pysparkling/context/init.py). Do you have any suggestion?? thank you in advance :)

    Hi there :wave: I'm trying to upgrade to Sparkling Water3.32.1.7-1-3.1 but I'm facing an error when I try to launch an H2O cluster locally for tests. My Scala version is 2.12.14. When I run


    I get

    An exception or error caused a run to abort: scala.tools.nsc.Settings.usejavacp()Lscala/tools/nsc/settings/AbsSettings$AbsSetting; 
    java.lang.NoSuchMethodError: scala.tools.nsc.Settings.usejavacp()Lscala/tools/nsc/settings/AbsSettings$AbsSetting;
        at ai.h2o.sparkling.repl.H2OInterpreter.createSettings(H2OInterpreter.scala:55)
        at ai.h2o.sparkling.repl.BaseH2OInterpreter.initializeInterpreter(BaseH2OInterpreter.scala:113)
        at ai.h2o.sparkling.repl.BaseH2OInterpreter.<init>(BaseH2OInterpreter.scala:265)
        at ai.h2o.sparkling.repl.H2OInterpreter.<init>(H2OInterpreter.scala:41)
        at ai.h2o.sparkling.backend.api.scalainterpreter.ScalaInterpreterServlet.ai$h2o$sparkling$backend$api$scalainterpreter$ScalaInterpreterServlet$$createInterpreterInPool(ScalaInterpreterServlet.scala:101)
        at ai.h2o.sparkling.backend.api.scalainterpreter.ScalaInterpreterServlet.$anonfun$initializeInterpreterPool$1(ScalaInterpreterServlet.scala:95)
        at ai.h2o.sparkling.backend.api.scalainterpreter.ScalaInterpreterServlet.$anonfun$initializeInterpreterPool$1$adapted(ScalaInterpreterServlet.scala:94)
        at scala.collection.immutable.Range.foreach(Range.scala:158)
        at ai.h2o.sparkling.backend.api.scalainterpreter.ScalaInterpreterServlet.initializeInterpreterPool(ScalaInterpreterServlet.scala:94)
        at ai.h2o.sparkling.backend.api.scalainterpreter.ScalaInterpreterServlet.<init>(ScalaInterpreterServlet.scala:48)
        at ai.h2o.sparkling.backend.api.scalainterpreter.ScalaInterpreterServlet$.getServlet(ScalaInterpreterServlet.scala:145)
        at ai.h2o.sparkling.backend.api.ServletRegister.register(ServletRegister.scala:30)
        at ai.h2o.sparkling.backend.api.ServletRegister.register$(ServletRegister.scala:29)
        at ai.h2o.sparkling.backend.api.scalainterpreter.ScalaInterpreterServlet$.register(ScalaInterpreterServlet.scala:142)
        at water.webserver.jetty9.SparklingWaterJettyHelper.createServletContextHandler(SparklingWaterJettyHelper.scala:43)
        at water.webserver.jetty9.SparklingWaterJettyHelper.startServer(SparklingWaterJettyHelper.scala:95)
        at ai.h2o.sparkling.backend.utils.ProxyStarter$.startFlowProxy(ProxyStarter.scala:42)
        at ai.h2o.sparkling.H2OContext.<init>(H2OContext.scala:103)
        at ai.h2o.sparkling.H2OContext$.getOrCreate(H2OContext.scala:470)

    I verified that Settings.usejavacp() exists in the Scala SDK so I'm not sure why it's missing during run time.
    Tips for a fix would be greatly appreciated!

    2 replies
    Carlos Laviola

    Hi all, I'm trying to upgrade sparkling water in our hadoop cluster to the latest version and running into some issues. One of our use cases is running it in an oozie workflow, through a spark action. the oozie spark action essentially is like submitting a job using spark-submit, but we're getting this error:

    Job aborted due to stage failure: Task 0 in stage 0.0 failed 10 times, most recent failure: Lost task 0.9 in stage 0.0 (TID 9, hostname.example.com, executor 5): java.lang.IllegalStateException: unread block data

    and we're a bit stumped, as it seems like everything is correct.
    I can show pastes of the options passed to spark and of the python script itself.

    2 replies