by

Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • Jun 19 2018 00:00
    @ledell banned @renatomarinho
Michal Kurka
@michalkurka
GLM & Deep Learning will by default use mean imputation, rest of the alsos handle NAs natively. The metalearner in SE should never see NAs because it is acting on the predictions of the submodels and they by default all handle NA (either impute or just handle natively). Do you have any specific issue?
Juan C Rodriguez
@jcrodriguez1989
As I've seen some algorithms perform better if NAs are not faked (imputed). So for my specific case (using Ensemble), I am not going to impute values, I will let GLM & DL to internally-impute.
thanks @michalkurka !
schwannden
@schwannden
In h2o, are there any standardization being done in the process of training a model for target (response)? I am aware that document says features get standardized in certain algorithm, but what about target?
ssadiq07
@ssadiq07
With the new h2o.train_segments functionality is anything being worked on to parallel score these models? I am attempting to develop thousands of models and looping through the model objects to score new data will be time consuming.
Juan C Rodriguez
@jcrodriguez1989
Hi everyone,
Maybe this project could be helpful for someone else: https://github.com/jcrodriguez1989/pocaAgua .
It gives a really easy way to make a PoC with a UI. It is really useful for me when working with clients.
Of course it is using AutoML magic.
Jordan Bentley
@jbentleyEG
@jakubhava I am trying to upgrade to the latest version of Sparkling Water and RSparkling, but I am having trouble working with the H2OContext replacing h2o_context
I used to be able to pass the h2o_context I had constructed in R into Scala, where a large chunk of my ML lives (with the intention of wrapping it in pyspark in the future)
> h2o <- rsparkling::H2OContext(spark)
> sparklyr::invoke_static(spark, "com.apptegic.datascience.corvus.CorvusContext", 
                          +                         "setH2O", h2o)
Error in stop("Unsupported type '", type, "' for serialization") : 
  object 'type' not found
I'm looking at git and it looks like you are about to deprecate org.apache.spark.h2o.H2OContext for ai.h2o.sparkling.H2OContext, will that possibly fix the issue when it is released?
I don't think it's a documented feature, but maybe there could be a test to ensure that the H2OContext can be passed into Scala?
I can add a JIRA if you would like
Jordan Bentley
@jbentleyEG
also, all of the re-design that appears to be going on around the H2OContext for SW, any chance that it will translate into support for dynamic allocation?
2 replies
Jakub Háva
@jakubhava
@jbentleyEG pleease seee https://docs.h2o.ai/sparkling-water/2.4/latest-stable/doc/rsparkling.html on how to use the latest RSparkling
Jordan Bentley
@jbentleyEG
ok, thanks, I'll give it a try
Jordan Bentley
@jbentleyEG
@jakubhava
> h2oConf <- rsparkling::H2OConf()
> h2o <- rsparkling::H2OContext.getOrCreate(h2oConf)
> sparklyr::invoke_static(spark, "com.apptegic.datascience.corvus.CorvusContext", "setH2O", h2o)
Error in stop("Unsupported type '", type, "' for serialization") : 
  object 'type' not found
still no luck
The method being called:
  def setH2O(h2oContext: H2OContext) = {
    this.h2oContext = h2oContext
  }
Jordan Bentley
@jbentleyEG
This is the sparklyr method that is failing:
writeObject <- function(con, object, writeType = TRUE) {
  type <- class(object)[[1]]

  if (type %in% c("integer", "character", "logical", "double", "numeric", "factor", "Date", "POSIXct")) {
    if (is.na(object)) {
      object <- NULL
      type <- "NULL"
    }
  }

  serdeType <- getSerdeType(object)
  if (writeType) {
    writeType(con, serdeType)
  }
  switch(serdeType,
         NULL = writeVoid(con),
         integer = writeInt(con, object),
         character = writeString(con, object),
         logical = writeBoolean(con, object),
         double = writeDouble(con, object),
         numeric = writeDouble(con, object),
         raw = writeRaw(con, object),
         array = writeArray(con, object),
         list = writeList(con, object),
         struct = writeList(con, object),
         spark_jobj = writeJobj(con, object),
         environment = writeEnv(con, object),
         Date = writeDate(con, object),
         POSIXlt = writeTime(con, object),
         POSIXct = writeTime(con, object),
         factor = writeFactor(con, object),
         `data.frame` = writeList(con, object),
         stop("Unsupported type '", type, "' for serialization"))
}
For whatever reason it is no longer recognizing h2o as a spark_jobj
Jordan Bentley
@jbentleyEG
ok, I think I figured it out
the jobj is now wrapped in the H2OContext, so I was able to get it with h2o$jhc and send it into scala
Juan C Rodriguez
@jcrodriguez1989
Hi, is it possible to give my own sort_metric function for AutoML?
If not, which one do you think could be similar to MAPE?
1 reply
Jordan Bentley
@jbentleyEG
@jakubhava for sparklyr, if I switch to an external backend instead of internal, could I expect more stability?
and can it generally use the same resources (cpus) as a cluster on the same machine as long as spark isn't doing any heavy lifting at the same time as H2O?
Jakub Háva
@jakubhava
@jbentleyEG Regarding backends, you should fine the answer here, if not, please let us know https://docs.h2o.ai/sparkling-water/2.4/latest-stable/doc/deployment/backends.html
btw: Can I suggest using sparkling-water channel for Sparkling Water related questions?
Simon Schmid
@SimonSchmid
Hello all,
when will version 3.30.0.4 of h2o-scala be available? I cannot find the jar to download. See https://mvnrepository.com/artifact/ai.h2o/h2o-scala
Michal Kurka
@michalkurka

@SimonSchmid we are in the process of deprecating this module, for your project you can either build the artifact from h2o-3 sources or just keep using 3.30.0.3 artifact (there was no change).

h2o-scala was primarily meant for Sparkling Water project, which no longer relies on it

What is your use case for h2o-scala?

Simon Schmid
@SimonSchmid
I was actually using it for sparkling water and was about to update to the latest version. Seems like I don't need the dependency anymore in this case. Thanks!
Michal Kurka
@michalkurka
yeah, for Sparkling Water you don’t need it
razou
@razou
Hello,
I wanted a way to perform zero-padding on a string column (with numbers like 1,2, 43, …)
But unable do it with apply:
df['x'].ascharacter().apply(lambda x: x.zfill(2), axis=1).head(10)
ValueError: Unimplemented: op <zfill> not bound in H2OFrame
Any idea ?
Thanks
Apply function seems to work only with limited number of operations
Michal Kurka
@michalkurka
you are correct, apply only works for certain built-in functions; using an arbitrary function is not supported
razou
@razou
Thanks @michalkurka for your answer
Do you have any on how could do such operations (i.e applying an arbitray function to each row of given frame)
Michal Kurka
@michalkurka

you can convert the column to Pandas, apply your transformation, then convert back to to H2OFrame and cbind with the original frame

this is cumbersome but H2O currently doesn’t have the functionality to execute any python code on the backend

razou
@razou

you can convert the column to Pandas, apply your transformation, then convert back to to H2OFrame and cbind with the original frame

this is cumbersome but H2O currently doesn’t have the functionality to execute any python code on the backend

That’s I have done, but I wanted to avoid doing that convertion because working with pandas with large datasets is very is very slow, not like frames

Michal Kurka
@michalkurka
right, if your use case involves specific data munging that we don’t support you might want to consider using Sparkling Water - do the munging in Spark and ML in H2O
razou
@razou
That would the best solution. Thanks
Kriszhou1
@Kriszhou1
I don't know how to make h20.ai access to IFrame,Anyone can help me solve the problem?
John
@wafflejohnh_twitter

Hi all, when running autoML
h2o.automl.get_leaderboard(aml, seems to return leaderboard with training metrics instead of the xval metrics described in the docs. Is this a known issue? I haven't been able to find whats causing this - Appreciate any pointers from the fellow experts (Y) Thanks!

autoML run is set up as such:
aml = H2OAutoML(max_models=20, seed=21, nfolds=5, sort_metric='rmse', keep_cross_validation_predictions=True, keep_cross_validation_models=False, project_name = project_name) aml.train( y=target, training_frame=hf_train)

Erin LeDell
@ledell
@wafflejohnh_twitter can you provide a reproducible exmaple please? they should be the CV metrics...
razou
@razou

Hi,
I’m guetting strange behaviors when reading data (folder containing many csv files) from Amazon S3 with import_file

  • I generate my data with Spark + Scala and write it into csv format in an S3 folder
  • When loading that data with h2o.import_file(path=s3_path, pattern=".*\.csv”) I’m gueting some data of the column 28 int the column 1 and some data of the column 29 int the column 2, …

Does this function have a limitation in number of columns or did I missided on an addtional option, or it’s due to something else ?

Thanks

John
@wafflejohnh_twitter
@ledell I rerun the automl call in a new kernel and it is showing xval in the leaderboard now. Thanks for confirming. Could it be if a previous autoML run disables xval, leaderboard will only show training scores on subsequent models generated with xval/without xval scores?
9 replies
razou
@razou
It seems that na_omit() function removes rows contaning at least one NA entry ? Does it exist an option to remove only rows where all entries are NA ?
Thanks
akshayi1
@akshayi1
Are there any plans to introduce walk-forward validation for timeseries predictions?
razou
@razou

Hi
I wanted to perform a frequency encoding on categorical features in a given frame
Suppose that I have a dictionary where keys are categorical columns and values are their frequencies. How can one replace each category with its frequency on the entire frame

Example with pandas data frame

fe = h2oDF_fe.groupBy(col).size()/len(fe)
h2oDF_fe.loc[:, col_fe] = h2oDF_fe[col].map(fe)

Is there for h2o frames an equivalent function to pandas dataframe's map function

10 replies
Juan C Rodriguez
@jcrodriguez1989
Is there a way to disable GPU backend for AutoML?
On my ubuntu laptop, when xgboost is included in AutoML, it is crashing.
If I run just xgboost(..., backend = "auto") it is also crashing.
However, xgboost(..., backend = "cpu") works.
I've tried with several h2o versions.
I could provide a simple reprex and session info.
2 replies
razou
@razou
Hi,
For xgboost() only one_hot_encoding seems to work whathever the specified encoding (e.g . even when 'categorical_encoding': 'enum' => the model uses one hot encoding ) Is it correct ?
3 replies
Chrinide
@chrinide
Is there any possibility to implement Rotation Forest in H2O? Based on H2O's great Random Forest and Principal Component Analysis algorithms, It seems that very easy to get a Rotation Forest? A report in arxiv https://arxiv.org/abs/1809.06705 shows that Rotation Forest is the best classifier for problems
with continuous features.
1 reply
razou
@razou

Hi
With a dictionary (imputer_dict) where keys are the same as my frame's columns I wanted to fill NAs in a frame with that dictionary's values as it is possible with pandas dataframes (e.g DataFrame.fillna(value = imputer_dict))

For each column (x) in my frame I wanted to replace its NAs with its value in imputer_dict (i.e imputer_dict[x])

If anybody have an idea

Thanks

12 replies