Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Jelmer Kuperus
    @jelmerk
    ?
    it makes no sense that it does this automatically. on some heuristic to me... it means in every different dataset could up with a different type.. that's awful
    Jelmer Kuperus
    @jelmerk
    It seems there is a ascharacter method in python but i cannot find its scala equivalent
    Jelmer Kuperus
    @jelmerk

    I am lost . I tried changing the frame with

    and

    h2oFrame((name: String, vec: Vec) => vec.toStringVec, Array("item"))
    val vec = h2oFrame.vec("item")
    h2oFrame.replace(h2oFrame.find("item"), vec.toStringVec)
    vec.remove
    which seems to work until i pass. it to word2vec. and it crashes on a nullpointer
    because. h2oFrame._key.get.vec(0) is now null
    Jelmer Kuperus
    @jelmerk
    ugh if i do DKV.put(h2oFrame) it works...
    Jelmer Kuperus
    @jelmerk
    Latest attempt
    h2oContext.asH2OFrame(sequences, "sequences")((name: String, vec: Vec) => vec.toStringVec(), Array("item")).update()
    Jelmer Kuperus
    @jelmerk
    this thing is a nullpointer factory.. if i call word2Vec.toFrame and try to convert that back to a spark df it crashes because this frame has a null key, how do you fix that
    Michal Kurka
    @michalkurka

    basically anytime you make a change on an H2O object (eg. Frame) you need to reflect the updates to the distributed memory, otherwise all kinds of weird things can happen

    one way is to do DKV.put(frame) or follow the write-lock - update pattern, see eg.:

    public Frame toCategoricalCol(int columIdx){
    write_lock();
    replace(columIdx, vec(columIdx).toCategoricalVec()).remove();
    // Update frame in DKV
    update();
    unlock();
    return this;
    }

    the low level API can be intimidating, that is part of the reason why we have a more user friendly API in Sparkling Water
    Seems like we are missing a feature that would let you specify target column type when you convert Spark frame to H2O. Please feel free to request this improvement.
    Jelmer Kuperus
    @jelmerk
    @michalkurka is calling .update() on that frame. a. valid way too ? for me that seemed to work and looked less ugly than updating dkv itself
    Michal Kurka
    @michalkurka
    update() is valid, however, keep in mind that the update object needs to be write-locked
    Jelmer Kuperus
    @jelmerk
    Being new to h2o, i have no idea what that means :-)
    Michal Kurka
    @michalkurka
    see the example above showing the implementation of toCategoricalCol
    Jelmer Kuperus
    @jelmerk
    created https://0xdata.atlassian.net/browse/SW-2441 as an improvement request
    @michalkurka do you reckon h2oContext.asDataFrame crashing on a frame with a null. key is a. bug ? or is that just me getting lost in low level api land
    Michal Kurka
    @michalkurka
    I don’t recall such a bug - but I know the low level api sometimes leads to NPEs like that
    Jelmer Kuperus
    @jelmerk
    Yeah it seems really brittle and poorly designed. But there does not seem to be a highlevel scala or java api
    Despite that being its underpinnings
    Marek Novotný
    @mn-mikke
    @jelmerk What's your SW version? If you want to use the internal Java API on Sparkling Water 3.30+, set the spark.ext.h2o.rest.api.based.client property to false. By default, SW in newer versions runs a thin client on the Spark driver which doesn't have direct access to H2O DKV. So if you create a h2o frame on the Spark driver with the internal Java API it won't get to the cluster with default settings and you will experience 'a frame with a null key' errors as you mentioned.
    Jelmer Kuperus
    @jelmerk
    @mn-mikke in this case i called toFrame on a Word2VecModel and it produces a frame with a null key (by design i think)
    i simply wanted to convert this dataframe to a spark dataframe and it gave an error
    if i use h2oContext.asDataFrame(new Frame(w2v.toFrame)) it works
    Marek Novotný
    @mn-mikke
    @jelmerk Can you share your code with a Word2VecModel
    basically i just want a list of words + word embeddings as a spark dataframe
    if i look at water.Keyed it says Key mapping a Value which holds this object; may be null
    Jelmer Kuperus
    @jelmerk
    yet asDataFrame crashes when it is
    Michal Kurka
    @michalkurka

    Word2VecModel#toFrame does return a Frame with no key - that is intentional. H2O functions rarelly output keyed Frames because that would restrict the way you can work with the Frame. Not all Frames need to be in DKV, this is very heavily utilizid during model building when frames are adapted but the changes do not need to surface to the client/user.

    In your case you need to assign a new key and issue a DKV put.

    Jelmer Kuperus
    @jelmerk
    Is that more efficient than creating a new frame like i do above
    Michal Kurka
    @michalkurka
    not really more efficient, just different - the constructor you are using has a slightly different purpose but it fine for your use case
    also let me correct myself; you don’t need to install it in DKV either way if you are just going to do asDataFrame
    what you are doing is fine
    Sergio Calderón Pérez-Lozao
    @sergiocalde94

    hi! I´m trying to use pysparkling water from a jupyter notebook within a EMR cluster (aws) and I´m getting this error:

    Py4JJavaError: An error occurred while calling o124.asH2OFrameKeyString.
    : ai.h2o.sparkling.backend.exceptions.RestApiCommunicationException: External H2O node 100.84.158.78:54321 responded with
    Status code: 500 : Server Error

    pyspark version: 2.3.2
    pysparkling version: '3.30.0.2-1-2.3'
    36 replies
    Sergio Calderón Pérez-Lozao
    @sergiocalde94
    Jelmer Kuperus
    @jelmerk
    What does sparkling water use driver memory for ? it seems to need a lot and i'd think most of the work would be on the executors
    13 replies
    Cheng WeI
    @valkyrias_gitlab
    Hi All, i am running on SW on databricks and there is some problem.
    when i run hex_dt = hc.asH2OFrame(dt_model), it returns error:
    Py4JJavaError: An error occurred while calling o520.asH2OFrame.
    : ai.h2o.sparkling.backend.exceptions.RestApiCommunicationException: H2O node 10.139.64.4:54321 responded with
    Status code: 500 : Server Error
    does anyone know what is causing the error? dt_model is a spark dataframe
    12 replies
    Cheng WeI
    @valkyrias_gitlab
    image.png
    image.png
    Juan Campos
    @jcamposz
    @mn-mikke I'm trying to compile sparkling-water using github master branch, because I'd like to use the fixes for SW-2470 and SW-2476, before you build a new release. Currently I get errors when I compile. Do you know if there is a branch that I could use???? Thanks in advance
    4 replies
    Timir
    @TimirN

    Hello, I'm trying to train Xgboost model on EMR 6.1.0 but getting an error: "POST /3/ModelBuilders/xgboost not found"

    I use the follow params:

    spark-shell --master yarn --driver-memory 30G --driver-cores 30 --executor-cores 5 --executor-memory 8G --num-executors 60  --conf spark.yarn.am.memoryOverhead=36G --conf spark.executor.memoryOverhead=10G --conf spark.sql.shuffle.partitions=1200  --conf spark.serializer=org.apache.spark.serializer.KryoSerializer  --conf spark.dynamicAllocation.enabled=false --conf spark.locality.wait=0 --conf spark.scheduler.minRegisteredResourcesRatio=1  --conf spark.ext.h2o.allow_insecure_xgboost=true --packages ai.h2o:sparkling-water-package_2.12:3.32.0.2-1-3.0
    
    val aggRes = spark.read.parquet("s3://blah/blah")
    
    import ai.h2o.sparkling.H2OContext
    import ai.h2o.sparkling.ml.algos.H2OXGBoost
    val h2o = H2OContext.getOrCreate()
    
    case class ModelParams() {
        val sampleRate = Some(0.6)
        val subsample = Some(0.6)
        val nTrees = Some(140)
        val maxDepth = Some(10)
        val nFolds = Some(5)
        val eta = Some(0.05)
        val learnRate = Some(0.05)
        val gamma = Some(0F)
        val maxDeltaStep = Some(0F)
        val colsampleRate = Some(0.8)
        val colsampleByLevel = Some(0.8)
        val colsampleRatePerTree = Some(0.8)
        val colsampleBytree = Some(0.8)
        val regLambda = Some(1.0F)
        val minChildWeight = Some(5)
        val minRows = Some(5)
        val regAlpha: Option[Float] = None
    }
    
    
    val modelParams = ModelParams()
    val xgb = new H2OXGBoost()
        .setBackend("cpu") // Need to specify cpu, otherwise will default to CUDA
        .setLabelCol("ctr")
        .setBooster("gbtree")
        .setTreeMethod("auto")
        .setGrowPolicy("depthwise")
        .setNtrees(modelParams.nTrees.getOrElse(140))
        .setMaxDepth(modelParams.maxDepth.getOrElse(10))
        .setNfolds(modelParams.nFolds.getOrElse(5))
        .setSampleRate(modelParams.sampleRate.getOrElse(0.6))
        .setSubsample(modelParams.subsample.getOrElse(0.6))
        .setEta(modelParams.eta.getOrElse(0.05))
        .setLearnRate(modelParams.learnRate.getOrElse(0.05))
        .setGamma(modelParams.gamma.getOrElse(0.0F))
        .setMaxDeltaStep(modelParams.maxDeltaStep.getOrElse(0.0F))
        .setColSampleRate(modelParams.colsampleRate.getOrElse(0.8))
        .setColSampleByLevel(modelParams.colsampleByLevel.getOrElse(0.8))
        .setColSampleRatePerTree(modelParams.colsampleRatePerTree.getOrElse(0.8))
        .setColSampleByTree(modelParams.colsampleBytree.getOrElse(0.8))
        .setMinChildWeight(modelParams.minChildWeight.getOrElse(5).toDouble)
        .setMinRows(modelParams.minRows.getOrElse(5).toDouble)
        .setRegLambda(modelParams.regLambda.getOrElse(1.0F))
        .setRegAlpha(modelParams.regAlpha.getOrElse(0.0F))
    xgb.setSeed(-7381329239932670029L)
    xgb.setWeightCol("n_imp_pixels")
    
    val model = xgb.fit(aggRes)

    Did someone face with that error?

    4 replies
    Giordano Alvari
    @Dhonveli
    Hello, I am trying to run pysparkling directly from a notebook. I have executed all the step highlighted in the installation but I keep getting this error: cannot import name 'Context' from 'pysparkling.context' (path_env/lib/python3.7/site-packages/pysparkling/context/init.py). Do you have any suggestion?? thank you in advance :)
    ErnestChan
    @ErnestChan

    Hi there :wave: I'm trying to upgrade to Sparkling Water3.32.1.7-1-3.1 but I'm facing an error when I try to launch an H2O cluster locally for tests. My Scala version is 2.12.14. When I run

    H2OContext.getOrCreate(H2OConf()
          .setInternalClusterMode())

    I get

    An exception or error caused a run to abort: scala.tools.nsc.Settings.usejavacp()Lscala/tools/nsc/settings/AbsSettings$AbsSetting; 
    java.lang.NoSuchMethodError: scala.tools.nsc.Settings.usejavacp()Lscala/tools/nsc/settings/AbsSettings$AbsSetting;
        at ai.h2o.sparkling.repl.H2OInterpreter.createSettings(H2OInterpreter.scala:55)
        at ai.h2o.sparkling.repl.BaseH2OInterpreter.initializeInterpreter(BaseH2OInterpreter.scala:113)
        at ai.h2o.sparkling.repl.BaseH2OInterpreter.<init>(BaseH2OInterpreter.scala:265)
        at ai.h2o.sparkling.repl.H2OInterpreter.<init>(H2OInterpreter.scala:41)
        at ai.h2o.sparkling.backend.api.scalainterpreter.ScalaInterpreterServlet.ai$h2o$sparkling$backend$api$scalainterpreter$ScalaInterpreterServlet$$createInterpreterInPool(ScalaInterpreterServlet.scala:101)
        at ai.h2o.sparkling.backend.api.scalainterpreter.ScalaInterpreterServlet.$anonfun$initializeInterpreterPool$1(ScalaInterpreterServlet.scala:95)
        at ai.h2o.sparkling.backend.api.scalainterpreter.ScalaInterpreterServlet.$anonfun$initializeInterpreterPool$1$adapted(ScalaInterpreterServlet.scala:94)
        at scala.collection.immutable.Range.foreach(Range.scala:158)
        at ai.h2o.sparkling.backend.api.scalainterpreter.ScalaInterpreterServlet.initializeInterpreterPool(ScalaInterpreterServlet.scala:94)
        at ai.h2o.sparkling.backend.api.scalainterpreter.ScalaInterpreterServlet.<init>(ScalaInterpreterServlet.scala:48)
        at ai.h2o.sparkling.backend.api.scalainterpreter.ScalaInterpreterServlet$.getServlet(ScalaInterpreterServlet.scala:145)
        at ai.h2o.sparkling.backend.api.ServletRegister.register(ServletRegister.scala:30)
        at ai.h2o.sparkling.backend.api.ServletRegister.register$(ServletRegister.scala:29)
        at ai.h2o.sparkling.backend.api.scalainterpreter.ScalaInterpreterServlet$.register(ScalaInterpreterServlet.scala:142)
        at water.webserver.jetty9.SparklingWaterJettyHelper.createServletContextHandler(SparklingWaterJettyHelper.scala:43)
        at water.webserver.jetty9.SparklingWaterJettyHelper.startServer(SparklingWaterJettyHelper.scala:95)
        at ai.h2o.sparkling.backend.utils.ProxyStarter$.startFlowProxy(ProxyStarter.scala:42)
        at ai.h2o.sparkling.H2OContext.<init>(H2OContext.scala:103)
        at ai.h2o.sparkling.H2OContext$.getOrCreate(H2OContext.scala:470)

    I verified that Settings.usejavacp() exists in the Scala SDK so I'm not sure why it's missing during run time.
    Tips for a fix would be greatly appreciated!

    2 replies
    Carlos Laviola
    @claviola

    Hi all, I'm trying to upgrade sparkling water in our hadoop cluster to the latest version and running into some issues. One of our use cases is running it in an oozie workflow, through a spark action. the oozie spark action essentially is like submitting a job using spark-submit, but we're getting this error:

    Job aborted due to stage failure: Task 0 in stage 0.0 failed 10 times, most recent failure: Lost task 0.9 in stage 0.0 (TID 9, hostname.example.com, executor 5): java.lang.IllegalStateException: unread block data

    and we're a bit stumped, as it seems like everything is correct.
    I can show pastes of the options passed to spark and of the python script itself.

    2 replies