Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Oscar Pan
    @OscarDPan
    Hi,
    I was comparing using h2o or spark to load a libsvm file and train a model. The reason I did it is because I found that h2o.import_file() doesn't let me declare numFeatures like Spark does, and the model just assumes num_columns = max_index in libsvm. This is a bit annoying as it makes len(acutal_columns_name) > len(model's columns). However, if I tried to use spark to load the svmlight file with "numFeatures" argument then pass it to sparkling.ml, the resulted model still ignores the "numFeatures". Is there any way I can explicitly tell h2o that the dataset I use has N columns, such that I will have a consistent number of columns across different libsvm files and models? Thanks!
    1 reply
    Oscar Pan
    @OscarDPan
    I understand that technically, if a feature column doesn't exist in a training set, we should ignore that column during inference time as we've learnt nothing from that column. But what would be the best practice to keep a consistent number of columns so I don't have to check whether len(acutal_columns_names) == len(models' columns)?
    ernestklchan
    @ernestklchan
    Hello again, do you know when the next version of Sparkling Water will be released? I'm facing issues during model scoring due to Sparkling Water computing Tree SHAP contributions, even though I set withContributions to false. I think this PR @jakubhava worked on fixes the issue: h2oai/sparkling-water#2218 . I'm hoping that fix will be in the next release
    13 replies
    foxale
    @foxale
    Is there any known reason for why H2OContext.getOrCreate() would sometimes freeze forever right after creating Spark Session (tried with different configs but usually it's something like:
    .config('spark.dynamicAllocation.enabled', 'false')
    .config('spark.executor.instances', '1')
    .config('spark.executor.cores', '5')
    .config('spark.sql.shuffle.partitions', '50')
    .config('spark.port.maxRetries', '128') # 32
    .config('spark.executor.memory', '16G')
    .config('spark.driver.memory', '16G')
    )
    2 replies
    Oscar Pan
    @OscarDPan
    Hi,
    I was wonder if this issue is really resolved? https://0xdata.atlassian.net/browse/SW-602
    I was trying to convert a spark dataFrame of [label, SparseVector(about 20k col)] to H2o_frame, but it took forever. Unless I had to save it to svmlight and load it back.
    Thanks!
    16 replies
    Oscar Pan
    @OscarDPan
    This message was deleted
    RajeshBirada
    @RajeshBirada
    Hi @jakubhava ,
    I am having couple of doubts while using H2OAutoML algo from PySparkling (Ver: 3.30.1.1-1-2.2)
    Is there a way we can load a model with modelid, that we get in leaderboard. We had this feature in h2o.automl .
    Also how can we save H2OAutoML model and load it. Beacause when I tried was getting exception.
    Instead when I tried add it as stage of pipeline, I was able to reload the model . But loading it back as H2OAutoML object was failing
    8 replies
    RajeshBirada
    @RajeshBirada
    Adding to above : I was looking to get the confusion matrix, F1-Score and othe metricss related to classification.I dont see those in H2OAutoml (sparkling water)
    2 replies
    Jelmer Kuperus
    @jelmerk

    the schema of my spark dataframe is

     |-- item: string (nullable = true)

    i convert it to h2o with

    h2oContext.asH2OFrame(sequences, "sequences")

    But then somehow the type of the column becomes Enum

    how can i avoid that ?
    Jelmer Kuperus
    @jelmerk
    Looking at the code it's the guessTypes method of PreviewParseWriter
    ?
    it makes no sense that it does this automatically. on some heuristic to me... it means in every different dataset could up with a different type.. that's awful
    Jelmer Kuperus
    @jelmerk
    It seems there is a ascharacter method in python but i cannot find its scala equivalent
    Jelmer Kuperus
    @jelmerk

    I am lost . I tried changing the frame with

    and

    h2oFrame((name: String, vec: Vec) => vec.toStringVec, Array("item"))
    val vec = h2oFrame.vec("item")
    h2oFrame.replace(h2oFrame.find("item"), vec.toStringVec)
    vec.remove
    which seems to work until i pass. it to word2vec. and it crashes on a nullpointer
    because. h2oFrame._key.get.vec(0) is now null
    Jelmer Kuperus
    @jelmerk
    ugh if i do DKV.put(h2oFrame) it works...
    Jelmer Kuperus
    @jelmerk
    Latest attempt
    h2oContext.asH2OFrame(sequences, "sequences")((name: String, vec: Vec) => vec.toStringVec(), Array("item")).update()
    Jelmer Kuperus
    @jelmerk
    this thing is a nullpointer factory.. if i call word2Vec.toFrame and try to convert that back to a spark df it crashes because this frame has a null key, how do you fix that
    Michal Kurka
    @michalkurka

    basically anytime you make a change on an H2O object (eg. Frame) you need to reflect the updates to the distributed memory, otherwise all kinds of weird things can happen

    one way is to do DKV.put(frame) or follow the write-lock - update pattern, see eg.:

    public Frame toCategoricalCol(int columIdx){
    write_lock();
    replace(columIdx, vec(columIdx).toCategoricalVec()).remove();
    // Update frame in DKV
    update();
    unlock();
    return this;
    }

    the low level API can be intimidating, that is part of the reason why we have a more user friendly API in Sparkling Water
    Seems like we are missing a feature that would let you specify target column type when you convert Spark frame to H2O. Please feel free to request this improvement.
    Jelmer Kuperus
    @jelmerk
    @michalkurka is calling .update() on that frame. a. valid way too ? for me that seemed to work and looked less ugly than updating dkv itself
    Michal Kurka
    @michalkurka
    update() is valid, however, keep in mind that the update object needs to be write-locked
    Jelmer Kuperus
    @jelmerk
    Being new to h2o, i have no idea what that means :-)
    Michal Kurka
    @michalkurka
    see the example above showing the implementation of toCategoricalCol
    Jelmer Kuperus
    @jelmerk
    created https://0xdata.atlassian.net/browse/SW-2441 as an improvement request
    @michalkurka do you reckon h2oContext.asDataFrame crashing on a frame with a null. key is a. bug ? or is that just me getting lost in low level api land
    Michal Kurka
    @michalkurka
    I don’t recall such a bug - but I know the low level api sometimes leads to NPEs like that
    Jelmer Kuperus
    @jelmerk
    Yeah it seems really brittle and poorly designed. But there does not seem to be a highlevel scala or java api
    Despite that being its underpinnings
    Marek Novotný
    @mn-mikke
    @jelmerk What's your SW version? If you want to use the internal Java API on Sparkling Water 3.30+, set the spark.ext.h2o.rest.api.based.client property to false. By default, SW in newer versions runs a thin client on the Spark driver which doesn't have direct access to H2O DKV. So if you create a h2o frame on the Spark driver with the internal Java API it won't get to the cluster with default settings and you will experience 'a frame with a null key' errors as you mentioned.
    Jelmer Kuperus
    @jelmerk
    @mn-mikke in this case i called toFrame on a Word2VecModel and it produces a frame with a null key (by design i think)
    i simply wanted to convert this dataframe to a spark dataframe and it gave an error
    if i use h2oContext.asDataFrame(new Frame(w2v.toFrame)) it works
    Marek Novotný
    @mn-mikke
    @jelmerk Can you share your code with a Word2VecModel
    basically i just want a list of words + word embeddings as a spark dataframe
    if i look at water.Keyed it says Key mapping a Value which holds this object; may be null
    Jelmer Kuperus
    @jelmerk
    yet asDataFrame crashes when it is
    Michal Kurka
    @michalkurka

    Word2VecModel#toFrame does return a Frame with no key - that is intentional. H2O functions rarelly output keyed Frames because that would restrict the way you can work with the Frame. Not all Frames need to be in DKV, this is very heavily utilizid during model building when frames are adapted but the changes do not need to surface to the client/user.

    In your case you need to assign a new key and issue a DKV put.

    Jelmer Kuperus
    @jelmerk
    Is that more efficient than creating a new frame like i do above
    Michal Kurka
    @michalkurka
    not really more efficient, just different - the constructor you are using has a slightly different purpose but it fine for your use case
    also let me correct myself; you don’t need to install it in DKV either way if you are just going to do asDataFrame
    what you are doing is fine
    Sergio Calderón Pérez-Lozao
    @sergiocalde94

    hi! I´m trying to use pysparkling water from a jupyter notebook within a EMR cluster (aws) and I´m getting this error:

    Py4JJavaError: An error occurred while calling o124.asH2OFrameKeyString.
    : ai.h2o.sparkling.backend.exceptions.RestApiCommunicationException: External H2O node 100.84.158.78:54321 responded with
    Status code: 500 : Server Error

    pyspark version: 2.3.2
    pysparkling version: '3.30.0.2-1-2.3'
    36 replies
    Sergio Calderón Pérez-Lozao
    @sergiocalde94
    Jelmer Kuperus
    @jelmerk
    What does sparkling water use driver memory for ? it seems to need a lot and i'd think most of the work would be on the executors
    13 replies