Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    yass2018
    @yass2018
    i don't know how to implement it in spark using libSVM and i don't think it is possible
    Vishal Bahirwani
    @vishalbahirwani
    I'm using Pysparkling-2.2 and training H2OGradientBoostingEstimator model. One of the parameters in search_criteria is 'max_runtime_secs': 1800 (to limit grid search to 30 mins). However, grid search / model training does not converge or stop within required time. H2O is running in internal backend mode on AWS EMR. GBM tuning steps are followed as shown in the example - https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/tutorials/gbm/gbmTuning.ipynb. Interestingly, if I reduce max_runtime_secs to 10-15 mins, training converges. Anything greater than this, and training keeps on going.
    Erin LeDell
    @ledell
    thanks for the report @vishalbahirwani that sounds like a bug
    Satyanarayan Bhanja
    @snbhanja
    Hello, I need a help. Has anyone done sparkling water kubernetes setup for training? I am not able to find any documents or blogs?
    foxale
    @foxale
    H2OAutoML is lacking leaderboard and many more functionalities in Python Sparkling Water API comparing to FlowUI - will it be like that forever?
    Erin LeDell
    @ledell
    hi @foxale personally i have not tried using AutoML with PySparkling yet...maybe @jakubhava has. do you have a code example you can share? we intend to have full support in all our interfaces.
    foxale
    @foxale
    @ledell Sure, here you go: https://colab.research.google.com/gist/foxale/74208b8f3e5bc819db20889d6a973dd1/automl-in-pysparkling-vs-h2o.ipynb
    • all of the errors are on purpose, to show the inconsistency between h2o and pysparkling
    • pysparkling ML functions work on pyspark DataFrames, not on h2o_frames (???)
    • AutoML trains models and saves the leader as a single MOJO model, you can't access the leaderboard and other use other functionalities implemented both in local h2o and FlowUI
    • the docs of pysparkling ml funcitons are incomplete or non-existing
      The sparkling water / pysparkling is under development of course, and the work your've already done is absolutely great, however an amount of time necessary to understand what can be done in pysparkling water vs pyspark vs h2o vs different sparkling water APIs is extraordinary.
    foxale
    @foxale
    Ok I just realized that if u use H2OContext.getOrCreate() and then use some model from h2o.estimators, it will be cluster-computer within h2o cluster (in both external and internal backend) and it has full api support. Am I right?
    So the last question is, will the pysparkling api have the same api support as h2o in python, is it even possible
    Erin LeDell
    @ledell
    @foxale so is your example working now? or do you still have issues using .train() and the other methods?>
    foxale
    @foxale
    @ledell Yeah, based off of what I understand, the algos in pysparkling package are crippled so that they can be fully integrated into pyspark's pipelines. That's why they lack numerous arguments and attributes like .train() or .leaderboard(). Correct me if I'm wrong.
    Erin LeDell
    @ledell
    @foxale ok, well we would want to fix that if its the case.... @jakubhava is back from vacation now so maybe he can take a look! :-)
    Jakub Háva
    @jakubhava

    @foxale

    If you need to use H2O Frames, I suggest to use H2O API to run AutoML and other H2O algorithms. The purpose of Sparkling Water is to integrate Spark and H2O and hide the implementation details. So the exposed Algorithms in Sparkling Water/PySparkling are designed to be well-fited into Spark Pipelines and they follow the Spark API and uses Spark Frames.

    So I would say it's matter of choice. Even in PySparkling, you can freely use H2O python API and work directly on H2O Frames or you can use higher abstraction and go for Spark Wrappers of H2O Algorithms. This is architectural decision made on purpose.

    Thanks for the functionality feedback! Currently, it is not possible to obtain leader board in the Spark API. However we welcome pull requests from the community, the help is always very appreciated :)

    hugozanini
    @hugozanini

    Hi guys, I'm having a problema and I don't know how to solve

    I have a big dataset and I would like to use a GBM model to predict one feature. However I'm not finding examples that explain how to train my h2o model using a spark datatrame.

    I can do this converting to panda dataframe, but it's impracticable using big amounts of data

    Can someone help me? I'm having a lot of dificulties

    Jaroslaw Nowosad
    @yarenty
    Hi all. Not sure if this is correct channel, but… can I ask what is going on with SteamAI? I was presenting H2O with all cool features (SW, Driverless) and wanted to show my audience that there is full “thinking” behind productionization of ML - model repository with model serving … but suddenly is not on h2o.ai website anymore.
    Jaroslaw Nowosad
    @yarenty
    Second question about SteamAI: if not Steam AI - what do you suggest -> PredicitonIO? Model DB(mit)?
    Erin LeDell
    @ledell
    @yarenty it has been retired (we are no longer working on it), but the code is still available on github.
    NavyathaVobugari
    @NavyathaVobugari
    Hi all, Do anyone have idea on how to deploy sparkling water on kubernetes
    Cate Tucket
    @CatePodLove_twitter

    Hello all, I keep running into this error: AttributeError: 'H2OConf' object has no attribute 'getAll'

    However, it doesn't happen consistently. Sometimes the code runs fine, otherwise the same code produces this error. A quick search of documentation seems to indicate that this bug was resolved already, wondering what my next step should be

    Cheng WeI
    @valkyrias_gitlab
    @valkyrias_gitlab
    Hi all, i am working with pysparkling on databricks fixed number of workers cluster. Sometimes when training model I get error "java.lang.ArrayIndexOutOfBoundsException: 65535", but not always. Does anyone know the reason? Thanks
    Also, anyone knows any way to access H2O Flow for cluster running on Azure Databricks?
    ArunLakhotia
    @ArunLakhotia
    I am using spark and h2o(sparkling water) in my project. I have a line of code which converts a pandas dataframe to a spark dataframe. Intermittently, it gives an error as mentioned in the stack overflow link. The error comes after the h2o cluster has ran for some days.There was no change in the port of the Java server. https://stackoverflow.com/questions/56019257/converting-a-pandas-dataframe-to-a-spark-dataframe-gives-py4j-network-error
    foxale
    @foxale
    Hi I loved your H2O World conference held last november in London. Since then, I've been using H2O for almost every project at work. However, I tried to implement custom MAPE metric for GBM accordingly to this tutorial:
    https://github.com/h2oai/h2o-tutorials/blob/master/tutorials/custom_metric_func/CustomMetricFuncRegression.ipynb
    and it's not working correctly combined with Grid Search and k-fold crossvalidation. I always get the same error, on the last grid parameter combination:
    h2o java.util.NoSuchElementException
    Michal Kurka
    @michalkurka
    @foxale I am glad you enjoyed H2O world and that you are using H2O for your project! There were bunch of fixes in the custom metric functionality recently but your error looks different. Can you share more details please? H2O version, Python version... are you just running the tutorial or do you made modifications?
    razou
    @razou
    Hi @all does it exist a scala API for H2O Sparkling Water ? I'm using H2O with Spark to implement XGBoost in scala and I wanted to know setters/getters for this model. Thanks
    Michal Kurka
    @michalkurka
    razou
    @razou
    Thanks @michalkurka
    Does anybody knows if sparkling-water-ml_2.11 works with scala 2.12.x ?
    Michal Kurka
    @michalkurka
    @jakubhava would know
    razou
    @razou

    @razou yep, the Scala API for XGBoost is exposed: https://github.com/h2oai/sparkling-water/blob/master/ml/src/main/scala/org/apache/spark/ml/h2o/algos/H2OXGBoost.scala

    What are the differences between ntrees vs nEstimators and learnRate vs eta ? and how to specify the fact that we using an unbalanced data ?

    Michal Kurka
    @michalkurka

    they are the same thing; you can chose either

    ntrees / learn_rate will be familiar to H2O GBM’s users
    n_estimators / eta will be familiar to XGBoost users

    regarding handling of unbalanced data: you can max_delta_step as described on https://xgboost.readthedocs.io/en/latest/tutorials/param_tuning.html#handle-imbalanced-dataset
    it doesn’t seem like we currently support scale_pos_weight - please feel free to make a feature request!
    Cheng WeI
    @valkyrias_gitlab
    hi, does anyone know how to install pysparkling via a jar file?
    i don't know where to find it. And I only have this h2o_pysparkling_2.4-2.4.12.tar.gz
    Michal Kurka
    @michalkurka

    @valkyrias_gitlab you can isntall from the source package using pip:

    pip install h2o_pysparkling_2.4-2.4.12.tar.gz

    Cheng WeI
    @valkyrias_gitlab
    pip does not work really stably on databrick...I guess i will keep using PyPi if we don;t have a jar
    Michal Kurka
    @michalkurka
    oh okay
    not sure if I can help with this particular use case - perhaps @jakubhava ?
    razou
    @razou
    @michalkurka Thanks for your answers
    Jakub Háva
    @jakubhava
    @valkyrias_gitlab PySparkling is python package, there is no JAR file for it. ( It internally contains sparkling water assembly jar, but that is different thing)
    razou
    @razou
    Hi @here, does it exist any PySparkling API where I can found the complet list of H2Conf properties (setters) (e.g set_external_cluster_mode, use_manual_cluster_start, ...), Thanks ?
    razou
    @razou
    This solved solved my problem
    http://docs.h2o.ai/sparkling-water/2.4/latest-stable/doc/configuration/configuration_properties.html#external-backend-configuration-properties
    from pysparkling import *
    h2oConf = H2OConf(spark).set("spark.ui.enabled", "true").set(...)
    razou
    @razou

    Hi @here how to export / save pysparkling model in external source e.g amazon S3 or HDFS

    from pysparkling.ml import H2OGBM
    gbm = H2OGBM(labelCol=target, featuresCols=xCols, maxDepth=8)
    gbmModel = gbm.fit(trainingData)

    I wanted to know how to export gbmModel model into amazon S3 (in Mojo format for example)

    Thanks

    razou
    @razou
    If I was only using H2O without Spark the solution would be gbmModel.download_mojo(path="~/models/h2o/", get_genmodel_jar=True)
    razou
    @razou
    Hi again
    I wanted to know how to create MOJO model/file from pysparkling.ml models (not h2o.estimators ones). I didn't find any documentation explaining this yet.
    Thanks in advance !
    razou
    @razou
    Does anybody experienced this issse when creating H2OContext :
    spark = SparkSession.builder.appName("SparkApp").getOrCreate()
    h2oConf = H2OConf(spark).set("spark.ui.enabled", "true")
    hc = H2OContext.getOrCreate(spark, conf = h2oConf)
    An error occurred while calling z:org.apache.spark.h2o.JavaH2OContext.getOrCreate.
    : java.lang.RuntimeException: Cloud size under 24
        at water.H2O.waitForCloudSize(H2O.java:1827)
        at org.apache.spark.h2o.backends.internal.InternalH2OBackend$.org$apache$spark$h2o$backends$internal$InternalH2OBackend$$startH2OCluster(InternalH2OBackend.scala:102)
        at org.apache.spark.h2o.backends.internal.InternalH2OBackend.init(InternalH2OBackend.scala:74)
        at org.apache.spark.h2o.H2OContext.init(H2OContext.scala:129)
        at org.apache.spark.h2o.H2OContext$.getOrCreate(H2OContext.scala:401)
        at org.apache.spark.h2o.H2OContext.getOrCreate(H2OContext.scala)
        at org.apache.spark.h2o.JavaH2OContext.getOrCreate(JavaH2OContext.java:257)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:748)
    
    Traceback (most recent call last):
      File "/usr/local/lib/python3.6/site-packages/pysparkling/context.py", line 161, in getOrCreate
        jhc = jvm.org.apache.spark.h2o.JavaH2OContext.getOrCreate(jspark_session, selected_conf._jconf)
      File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
        answer, self.gateway_client, self.target_id, self.name)
      File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
        return f(*a, **kw)
      File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
        format(target_id, ".", name), value)
    py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.h2o.JavaH2OContext.getOrCreate.
    : java.lang.RuntimeException: Cloud size under 24
        at water.H2O.waitForCloudSize(H2O.java:1827)
        at org.apache.spark.h2o.backends.internal.InternalH2OBackend$.org$apache$spark$h2o$backends$internal$InternalH2OBackend$$startH2OCluster(InternalH2OBackend.scala:102)
        at org.apache.spark.h2o.backends.internal.InternalH2OBackend.init(InternalH2OBackend.scala:74)
        at org.apache.spark.h2o.H2OContext.init(H2OContext.scala:129)
        at org.apache.spark.h2o.H2OContext$.getOrCreate(H2OContext.scala:401)
        at org.apache.spark.h2o.H2OContext.getOrCreate(H2OContext.scala)
        at org.apache.spark.h2o.JavaH2OContext.getOrCreate(JavaH2OContext.java:257)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:748)
    Cheng WeI
    @valkyrias_gitlab
    Hi All, I am using SW 2.4.13 to train model and it paused at a place. I checked the log where it says:
    WARN: Exceeded waiting interval of 8040 seconds for a task of type 'Boosting Iteration (tid=301)' to finish on node '10.139.64.30/10.139.64.30:54321'.
    node 10.139.64.30 seems like the master node to me in this case. Do need to wait a bit longer?
    I used the same data before with 600 features and it worked. Now I am using 200 of them and it's giving me this problem. Not sure how I can fixe it
    Michal Kurka
    @michalkurka
    @valkyrias_gitlab I answered in the h2o-3 channel