Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • Jun 19 2018 00:00
    @ledell banned @renatomarinho
schwannden
@schwannden
are we simply using the following 3 method:
for binary classification: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/glm.html#logistic-regression-binomial-family
for regression: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/glm.html#linear-regression-gaussian-family
for multiclass classification, are we using something similar to binary classification's deviance (entropy)?
DarkBlaez
@DarkBlaez
All I am struggling with this a bit. What is the best way to get sparkling-water injected in to an existing spark cluster. I have a cluster I spun up to test across 3 VMs, so master + 2 workers. What is the mechanism I need. Do I just inject this as a package?
I am running spark 2.4.5 and have downloaded the latest sparkling-water
I will be accessing via Jupyter Notebook
Jakub Háva
@jakubhava
@DarkBlaez Please have a look into Sparkling Water documentation https://docs.h2o.ai/sparkling-water/2.4/latest-stable/doc/install/install_and_start.html
pbhat14
@pbhat14

as per release (https://www.h2o.ai/blog/h2o-release-3-26-yau/) it is said that SHAP values can be retrieved from MOJO as well. However in there is no function such as h2o.mojo_predict_contributions or equivalent ?

Once model is imported : gbm_m=h2o.import_mojo(‘GBM_model_R.zip’) and h2o.predict_contributions(gbm_m,data) is run ( note data is already a h2o data frame and h20 cluster is active ) , below is the output
image

There is another link (http://docs.h2o.ai/sparkling-water/2.2/latest-stable/doc/tutorials/shap_values.html) which doesn’t give clear guidance on how to retrieve SHAP values in other than the sparking water h2o version. How can we extract SHAP values with a MOJO object directly without the need to spin off a cluster i.e. functions such as h2o.mojo_predict_df

But I am not sure where to do it and if it'll be specific to my locally run package? My end goal is using the MOJO to extract SHAP value in a web API. How do I achieve this?
Honza Sterba
@honzasterba
@pbhat14 that is exactly what MOJO is ment for, you can embed the MOJO into to your web-app written in java and extract the SHAP values. here is an example http://docs.h2o.ai/h2o/latest-stable/h2o-docs/productionizing.html#step-2-compile-and-run-the-mojo
pbhat14
@pbhat14
Hi sorry. My webapi is using plumber. It uses R code and hence I need an R command such as h2o. Mojo_predict_df to incorporate in my code. Are there any such commands which help me do that for SHAP values I. E. Mojo_predict_contributions_df.
Pavel Pscheidl
@Pscheidl

@pbhat14 This problem has two dimensions. MOJO itself is able to give you SHAP contributions. Just enable the contributions by calling setEnableContributions(true) on EasyPredictModelWrapper. This is demonstrated in the link @honzasterba gave you above (http://docs.h2o.ai/h2o/latest-stable/h2o-docs/productionizing.html#step-2-compile-and-run-the-mojo).

MOJO import is a functionality that imports MOJO back into H2O, with limited functionality (it's not a native/binary model). Scoring works, as well as inspection of model stats. However, contributions are not available yet. There is a JIRA for that: https://0xdata.atlassian.net/browse/PUBDEV-7466 (please upvote if you feel like it)

Michal Kurka
@michalkurka

Hi sorry. My webapi is using plumber. It uses R code and hence I need an R command such as h2o. Mojo_predict_df to incorporate in my code. Are there any such commands which help me do that for SHAP values I. E. Mojo_predict_contributions_df.

hmm, I think PUBDEV-7466 won’t help you (cc: @Pscheidl) - it still needs H2O running. And if you had H2O running, you can ignore MOJO and work with a binary model to give you the Shapley values.

The best for you would really be a function like mojo_predict_contributions_df. Please feel free to file a feature request in our JIRA, IMHO this would be a good addition with moderate to low complexity.

pbhat14
@pbhat14
Noted , I will file a feature request. Wanted to take general advice - for deployment into production : is it advisable (preferable or suggested ) to use MOJO in R with functions such as mojo_predict or would an option of initiating an h20 cluster , importing the model and then predicting results using functions such as h2o. Predict in a plumber API be also a stable option?
Michal Kurka
@michalkurka
that really depends on what kind of data you want to score, is it batch scoring or line by line?
Michal Kurka
@michalkurka
give us a bit more information about your use case and I think we will be able to find a suitable solution for you
pbhat14
@pbhat14
It's a line . I mean one row of prediction in an operations environment. One row of data will be sent to the webapi and it is expected to return the probability in < 1 second. There are some transformations and checks in the API as well which are part of the R script apart from the MOJO predict . Can you provide general advice as we aren't sure which is the best most robust approach while many may be technically possible.
Michal Kurka
@michalkurka
I think in that case it is better to have H2O instance running and make requests to the instance, the advantage is that the model stays loaded in memory, with mojo_predict_df you would need to load/parse the model every single time, this might be fine for smaller models but as your model gets more complicated it might become an issue
pbhat14
@pbhat14
But an h20 instance in an r session also uses APi to connect with H20 , so it's an API within an API. Also from my initial testing it is slower in run time because converting data into as.h20 takes time which isn't required with MOJO. I don't really load the model , my MOJO is stored in the server and all it does it read it directly via MOJO predict . I don't use import_model and load the MOJO. I would only do that if I was using the h20 instance approached.
Juan C Rodriguez
@jcrodriguez1989
Hi, I could not find much information on how AutoML is handling missing values when predicting.
As far as I could research, it seems that this is being handled by the leader model (which of course could be an Ensemble).
So, what would it be a recommended approach to follow when predicting with rows with NA columns?
Just to mention, these are not straight-forward imputed columns.
Seb
@sebhrusen
Hi @jcrodriguez1989 , in AutoML, missing values are handled in a specific way by each algorithm: see the FAQ section for each algo used by AutoML (DeepLearning, DRF, GBM, GLM, XGBoost, StackedEnsemble) to get an explanation on how each of them is handling missing values: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science.html#supervised
H2OAutoML is not currently adding anything to handle those.
Juan C Rodriguez
@jcrodriguez1989
Hi @sebhrusen , thanks, I really appreciate your time :)
I guess, there are no recommendations. Anyways, this is a very problem-specific issue.
Michal Kurka
@michalkurka
GLM & Deep Learning will by default use mean imputation, rest of the alsos handle NAs natively. The metalearner in SE should never see NAs because it is acting on the predictions of the submodels and they by default all handle NA (either impute or just handle natively). Do you have any specific issue?
Juan C Rodriguez
@jcrodriguez1989
As I've seen some algorithms perform better if NAs are not faked (imputed). So for my specific case (using Ensemble), I am not going to impute values, I will let GLM & DL to internally-impute.
thanks @michalkurka !
schwannden
@schwannden
In h2o, are there any standardization being done in the process of training a model for target (response)? I am aware that document says features get standardized in certain algorithm, but what about target?
ssadiq07
@ssadiq07
With the new h2o.train_segments functionality is anything being worked on to parallel score these models? I am attempting to develop thousands of models and looping through the model objects to score new data will be time consuming.
Juan C Rodriguez
@jcrodriguez1989
Hi everyone,
Maybe this project could be helpful for someone else: https://github.com/jcrodriguez1989/pocaAgua .
It gives a really easy way to make a PoC with a UI. It is really useful for me when working with clients.
Of course it is using AutoML magic.
Jordan Bentley
@jbentleyEG
@jakubhava I am trying to upgrade to the latest version of Sparkling Water and RSparkling, but I am having trouble working with the H2OContext replacing h2o_context
I used to be able to pass the h2o_context I had constructed in R into Scala, where a large chunk of my ML lives (with the intention of wrapping it in pyspark in the future)
> h2o <- rsparkling::H2OContext(spark)
> sparklyr::invoke_static(spark, "com.apptegic.datascience.corvus.CorvusContext", 
                          +                         "setH2O", h2o)
Error in stop("Unsupported type '", type, "' for serialization") : 
  object 'type' not found
I'm looking at git and it looks like you are about to deprecate org.apache.spark.h2o.H2OContext for ai.h2o.sparkling.H2OContext, will that possibly fix the issue when it is released?
I don't think it's a documented feature, but maybe there could be a test to ensure that the H2OContext can be passed into Scala?
I can add a JIRA if you would like
Jordan Bentley
@jbentleyEG
also, all of the re-design that appears to be going on around the H2OContext for SW, any chance that it will translate into support for dynamic allocation?
2 replies
Jakub Háva
@jakubhava
@jbentleyEG pleease seee https://docs.h2o.ai/sparkling-water/2.4/latest-stable/doc/rsparkling.html on how to use the latest RSparkling
Jordan Bentley
@jbentleyEG
ok, thanks, I'll give it a try
Jordan Bentley
@jbentleyEG
@jakubhava
> h2oConf <- rsparkling::H2OConf()
> h2o <- rsparkling::H2OContext.getOrCreate(h2oConf)
> sparklyr::invoke_static(spark, "com.apptegic.datascience.corvus.CorvusContext", "setH2O", h2o)
Error in stop("Unsupported type '", type, "' for serialization") : 
  object 'type' not found
still no luck
The method being called:
  def setH2O(h2oContext: H2OContext) = {
    this.h2oContext = h2oContext
  }
Jordan Bentley
@jbentleyEG
This is the sparklyr method that is failing:
writeObject <- function(con, object, writeType = TRUE) {
  type <- class(object)[[1]]

  if (type %in% c("integer", "character", "logical", "double", "numeric", "factor", "Date", "POSIXct")) {
    if (is.na(object)) {
      object <- NULL
      type <- "NULL"
    }
  }

  serdeType <- getSerdeType(object)
  if (writeType) {
    writeType(con, serdeType)
  }
  switch(serdeType,
         NULL = writeVoid(con),
         integer = writeInt(con, object),
         character = writeString(con, object),
         logical = writeBoolean(con, object),
         double = writeDouble(con, object),
         numeric = writeDouble(con, object),
         raw = writeRaw(con, object),
         array = writeArray(con, object),
         list = writeList(con, object),
         struct = writeList(con, object),
         spark_jobj = writeJobj(con, object),
         environment = writeEnv(con, object),
         Date = writeDate(con, object),
         POSIXlt = writeTime(con, object),
         POSIXct = writeTime(con, object),
         factor = writeFactor(con, object),
         `data.frame` = writeList(con, object),
         stop("Unsupported type '", type, "' for serialization"))
}
For whatever reason it is no longer recognizing h2o as a spark_jobj
Jordan Bentley
@jbentleyEG
ok, I think I figured it out
the jobj is now wrapped in the H2OContext, so I was able to get it with h2o$jhc and send it into scala
Juan C Rodriguez
@jcrodriguez1989
Hi, is it possible to give my own sort_metric function for AutoML?
If not, which one do you think could be similar to MAPE?
1 reply
Jordan Bentley
@jbentleyEG
@jakubhava for sparklyr, if I switch to an external backend instead of internal, could I expect more stability?
and can it generally use the same resources (cpus) as a cluster on the same machine as long as spark isn't doing any heavy lifting at the same time as H2O?
Jakub Háva
@jakubhava
@jbentleyEG Regarding backends, you should fine the answer here, if not, please let us know https://docs.h2o.ai/sparkling-water/2.4/latest-stable/doc/deployment/backends.html
btw: Can I suggest using sparkling-water channel for Sparkling Water related questions?
Simon Schmid
@SimonSchmid
Hello all,
when will version 3.30.0.4 of h2o-scala be available? I cannot find the jar to download. See https://mvnrepository.com/artifact/ai.h2o/h2o-scala
Michal Kurka
@michalkurka

@SimonSchmid we are in the process of deprecating this module, for your project you can either build the artifact from h2o-3 sources or just keep using 3.30.0.3 artifact (there was no change).

h2o-scala was primarily meant for Sparkling Water project, which no longer relies on it

What is your use case for h2o-scala?

Simon Schmid
@SimonSchmid
I was actually using it for sparkling water and was about to update to the latest version. Seems like I don't need the dependency anymore in this case. Thanks!