Where communities thrive

  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
Repo info
  • Jun 19 2018 00:00
    @ledell banned @renatomarinho
hassan hawilo
for exmple we have stackensemble that uses DRF GBM and XGBoost it works fine
but once we introduce the deeplearning model with them it gives this error
hassan hawilo
still same error java.lang.ArrayIndexOutOfBoundsException: Index 1684 out of bounds for length 1684
at hex.genmodel.GenModel.setCats(GenModel.java:707)
at hex.genmodel.GenModel.setInput(GenModel.java:686)
at hex.genmodel.algos.deeplearning.DeeplearningMojoModel.score0(DeeplearningMojoModel.java:70)
at hex.genmodel.algos.deeplearning.DeeplearningMojoModel.score0(DeeplearningMojoModel.java:158)
at hex.genmodel.algos.ensemble.StackedEnsembleMojoModel.score0(StackedEnsembleMojoModel.java:39)
at hex.generic.GenericModel.score0(GenericModel.java:93)
at hex.Model.score0(Model.java:1992)
at hex.Model.score0(Model.java:1959)
at hex.Model$BigScore.score0(Model.java:1903)
at hex.Model$BigScore.map(Model.java:1881)
at water.MRTask.compute2(MRTask.java:675)
at water.H2O$H2OCountedCompleter.compute1(H2O.java:1578)
at hex.Model$BigScore$Icer.compute1(Model$BigScore$Icer.java)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1574)
... 5 more
still I am getting same error check the dataframe passed to the model and same columns as the training dataframe and non of the data is nan
hassan hawilo
tried older version of H2O now the error changed to this
java.lang.IllegalArgumentException: Unsupported MOJO model hex.genmodel.algos.deeplearning.DeeplearningMojoModel.
OSError: Job with key $03017f00000132d4ffffffff$_b756f6aab3e7b7d12d531ff7aec345c8 failed with an exception: java.lang.IllegalArgumentException: Unsupported MOJO model hex.genmodel.algos.deeplearning.DeeplearningMojoModel.
java.lang.IllegalArgumentException: Unsupported MOJO model hex.genmodel.algos.deeplearning.DeeplearningMojoModel.
at hex.generic.Generic$MojoDelegatingModelDriver.computeImpl(Generic.java:91)
at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:222)
at hex.generic.Generic$MojoDelegatingModelDriver.compute2(Generic.java:70)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1443)
at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

Hi all. I used H2o's Isolation Forest algorithm implementation in Python 3 in an AWS cluster environment (not sure which of these details is relevant). FYI, I am a data scientist, not a software engineer, so I am not proficient in Java, which I see a lot of the code is in.

My question is: is there a way to extract/save/see the attributes and split values selected for each of the trees that are trained for the isolation forest? I have scoured the documentation and looked at the code on GitHub without seeing any obvious way to do so. My use case is: demonstrating to a non-technical audience how these trees are, since they are skeptical of the "black-box" and lack of understanding of what attributes/split values the observations are being isolated by.


Nitesh yadav
Hi all,
I am working on some project where I want to save the leader of h2oAutoML model and load when its needed.
I am trying to do this with joblib but it didn't work.
Do I need to use h2o's save model stuff or joblib will work?
Please guide me.
3 replies
I'm using h2o.import_file() function to load multiple csv files like this:
h2o_frame = h2o.import_file(CSV_PATH, pattern='{0}_[0-9]+.csv$'.format('train'))
But it fails when one of the csv files is empty (having only column names)
Server error water.exceptions.H2OIllegalArgumentException: Error: File type mismatch. Cannot parse files [train_115092601.csv] and [train_202032.csv] of type CSV and CSV as one dataset.
How can I ignore the empty file or force the merge or any other idea to solve this issue ?
Thank you
Simon Schmid
Hi all,
I have a question regarding this line: https://github.com/h2oai/h2o-3/blob/master/h2o-core/src/main/java/hex/Model.java#L1536 Why do we care about this? Isn't the response column ignored for scoring anyway? I encountered the error by chance and was just wondering why the check is there.
2 replies
Frankie Logan
Hi all,
I am working with a tweedie glm model and was wondering why the AIC is showing up as NULL
I remarked that when using H2O the python logger module won't print anything ? Anybody has experienced this ?
import logging

logging.basicConfig(level=logging.DEBUG, format='%(asctime)s %(name)s %(levelname)s:%(message)s')
logger = logging.getLogger(__name__)

logger.info('Training size: ', train.nrow)
logger.info('Validation size: ', validation.nrow)
6 replies


Having a frame with some categorical columns (X1, .., X5) and X3 had only NaN in initial csv file
Strange behavior when using dtypes method

h2o_df['X3'] =  h2o_df['X3'].ascharacter().asfactor()
categorical_cols = [k for (k, v) in h2o_df.types.items() if v == 'enum' and k not in ['y']]

returns [X1, X2, X3,X4,X5,C1] instead of [X1, X2, X3, X4, X5]
Why the column C1 was added ?

Frankie.Logan: There is no reason for glm model not to have AIC. It is an oversight. I am adding it now for you. Here is the JIRA: https://h2oai.atlassian.net/browse/PUBDEV-8065 . You can check it here to see when it is done. Thank you for bringing it to my attention. Wendy
1 reply

Looking for feedback from the h2o community on how they productionize h2o models.

Based on the documentation, looks like java application is the standard for productionizing h2o models.


  1. Should h2o models be always productionized using java application(s) ?
  2. How are people who lack java skills but have python skills (like me) productionize h2o models.
  3. I was under an assumption that I can create a python flask, gunicorn application that will start and h2o cluster and use h2o.upload_mojo() function to load a h2o MOJO model and use it for prediction or using h2o.mojo_predict_csv / h2o.mojo_predict_pandas to get prediction from a h2o model without even starting an h2o cluster Is this not a good standard?
3 replies
Jay van Zyl
Hello, I have an issue with the rest api call when training a dl model:
Illegal argument for field: hidden of schema: DeepLearningParametersV3: cannot convert ""90"" to type int
Jay van Zyl
Found the solution.
Nitesh yadav
Hello, I m working on a college project to build my own AutoML.
How can H2O AutoML predict the type of models ( classification or regression ) to train? Which type of problem is regression or classification?
Simon Schmid
Hi all, I found a bug in the AutoML class, see https://github.com/h2oai/h2o-3/blob/ff45788d86eda742eb0464d66d938094250b32e8/h2o-automl/src/main/java/ai/h2o/automl/AutoML.java#L93. The synchronization is working properly for 2 concurrent calls but not for more than 2. If there are e.g. 3 concurrent calls at 12:00:00, all will have the same startTime. The first processed one then will take 12:00:00 as startTime and the second call will first wait and then retrieve new start times until it is at least 12:00:01. 12:00:01 will then be saved as lastStartTime which means that the third call is actually fine with keeping 12:00:00. This will then result in an error as it produces duplicate models ids together with the run of the first call. Not sure what the best fix is, probably just checking that startTime is a time after lastStartTime .
4 replies
Simon Schmid
Hi all, it's me again. From time to time, h2o runs a memory benchmark here https://github.com/h2oai/h2o-3/blob/master/h2o-core/src/main/java/water/HeartBeatThread.java#L163. What is the purpose of this? On Windows, it leads to a quite high CPU utilization when running. Took me a while to figure out where it comes from, a Thread Dump revealed it. I figured out that I can disable it by setting the system property sys.ai.h2o.heartbeat.benchmark.enabled=false. However, I am wondering what the downside is of disabling it.
2 replies
Hi, all. How to delete an experiment using python client?
1 reply

Hello. I'm having an annoyingly time-consuming issue and was wondering if anyone here has any suggestions.

I'm basically using h2o to train and cross-validate some ANNs on a few different bio-logging data sets (some immersion and some acceleration). It works fine for the immersion datasets (which are all binary and <120MB), completing in just a few hours, but for some reason hangs at 100% on training on the first acceleration one (which are float and fairly larger - 170MB-8GB).

I suspect its a memory issue, but with no error message I have no idea how to troubleshoot this and proceed. I've been stuck on this for a week now coming up to the climax of a masters project! Does anyone have any ideas?

The part where it hangs:

deeplearning Model Build progress: <progress bar> 100%


#!/usr/bin/env python3

import h2o
from h2o.estimators import H2ODeepLearningEstimator
import glob
import re

h2o.init(min_mem_size='30G', max_mem_size="100G")

files = glob.glob('../Data/Reduced/ACC*.csv')

for f in files:

    # Load data
    data = h2o.import_file(f, header=1)
    data['Dive'] = data['Dive'].asfactor()
    data['BirdID'] = data['BirdID'].asfactor()

    # Extract model ID from filepath
    wdw = re.search(r"/ACC(\d+)_reduced", f).group(1)

    # Build, train, and cross-validate model
    dl_cross = H2ODeepLearningEstimator(model_id = 'ACC_window_' + wdw,
                                        distribution = "bernoulli",
                                        hidden = [200, 200],
                                        fold_column = 'BirdID',
                                        keep_cross_validation_models = True,
                                        keep_cross_validation_fold_assignment = True,
                                        keep_cross_validation_predictions = True,
                                        score_each_iteration = True,
                                        epochs = 50,
                                        train_samples_per_iteration = -1,
                                        activation = "RectifierWithDropout",
                                        #input_dropout_ratio = 0.2,
                                        hidden_dropout_ratios = [0.2, 0.2],
                                        single_node_mode = False,
                                        balance_classes = False,
                                        force_load_balance = False,
                                        seed = 23123,
                                        score_training_samples = 0,
                                        score_validation_samples = 0,
                                        stopping_rounds = 0)

    dl_cross.train(x = data.columns[1:-1],

    # Save model
    h2o.save_model(model=dl_cross, path="../Data/Reduced/H2O_ACC_XVal_Models/", force=True)

# Close session
Hey, I am wondering what method the models use to perform multinomial classification. I'm having issues where stacked ensemble trains many times slower on certain multinomial classification problems compared to the other h2o models. Do you have suggestions on which parameters I can change to improve this? Thanks
Tomáš Frýda
Hi @jwolf9, two reasons I can think of:
(1) GLM metalearner not being able to converge quickly - this can be solved by setting metalearner_algorithm to gbm or xgboost, or fine-tuning GLM parameters ,
(2) H2O is running out of memory - stacked ensemble requires predictions from each base model for each class, which can lead to bigger "level-one" frames than the original dataset (especially if the response has a high cardinality). Handpicking subset of diverse models can help (or adding more memory to the H2O/JVM instance (but keep in mind that XGBoost uses native memory not the one assigned to JVM so there could be issues with XGBoost if you allocate too much memory for JVM)).
1 reply
Hello! Is this chat still monitored/active?
6 replies

Hello, I was wondering if subsetting via the result of h2o.which could be supported? I think that workflow is fairly common in base R so that you can identify row-indices once, then re-use those for subsequent subsets. Currently h2o.which returns a frame which can't be used to subset. You can convert that result to a vector but that is very inefficient with big data.

I have a really specific use-case for this that I posted on stackoverflow: https://stackoverflow.com/q/69366490/9244371 but it errors out saying rectangle assignment is unimplemented (if I'm reading that right). Thanks!

1 reply
pranith macha
Hi, Flow UI 3.36 won't let me download Uplift DRF model. Download of POJO and MOJO is disabled. Any suggestions? Thanks!
1 reply
Nicolás Cañibano
Hi, is there any way to enable SSL (encryption in-flight) using XGBoost on a multi-node cluster? Based on the code in H2O this seems not to be possible (ref: https://github.com/h2oai/h2o-3/blob/master/h2o-extensions/xgboost/src/main/java/hex/tree/xgboost/XGBoost.java). Is this related with an algorithm limitation or an H2O limitation? We are comparing and evaluating H2O-XGBoost vs SageMaker-XGBoost and not being able to enable intra-node encryption in-flight on H2O is a blocker to us. On the other hand, SageMaker seems to support encryption in-flight for XGBoost. Thanks in advance!
2 replies
Cristiano Sarmento
I just cloned h2o-3, and I'm trying to run it locally in Mac OS, but it is downloading the sample data for about more than 2 hours.. is this normal?
2 replies
Cristiano Sarmento
another question, I'm actually building h2o for the 1st time in a mac os with 16GB ram, with 2,5 GHz Intel Core i5 Dual-Core processor. Is this setting able to build / run h2o? I'm doing this question because my build stucks at :h2o-algos:testMultiNode step, it is actually there for about 2:40H at only 12% progress. Thanks!
6 replies
Simon Schmid

Hello, I think I found a bug in the GLM implementation when running multiple GLM training process in parallel. Actually, it is a regression because it used to work fine for me (I wanted to update h2o from to I tried to find the version that introduced the issue and it seems it has been with I have a Java application that allows me to reproduce the issue (I can share it if that helps), in general it trains multiple GLM models in parallel. Important is that it trains multinomial and binomial models concurrently, the exception does not occur for just binomial or just multinomial models. From time to time, such an exception is thrown:

Exception in thread "main" java.util.concurrent.ExecutionException: DistributedException from / 'Index 1 out of bounds for length 1', caused by java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1
    at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
    at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
    at src.main.java.GLMTest.main(GLMTest.java:46)
Caused by: DistributedException from / 'Index 1 out of bounds for length 1', caused by java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1
    at water.MRTask.getResult(MRTask.java:654)
    at water.MRTask.getResult(MRTask.java:664)
    at water.MRTask.doAll(MRTask.java:524)
    at water.MRTask.doAll(MRTask.java:476)
    at hex.glm.GLM$GLMGradientSolver.getGradient(GLM.java:3367)
    at hex.glm.ComputationState$2.getGradient(ComputationState.java:563)
    at hex.glm.ComputationState$2.getObjective(ComputationState.java:573)
    at hex.optimization.OptimizationUtils$SimpleBacktrackingLS.<init>(OptimizationUtils.java:74)
    at hex.glm.GLM$GLMDriver.fitIRLSM_multinomial(GLM.java:1598)
    at hex.glm.GLM$GLMDriver.fitModel(GLM.java:2028)
    at hex.glm.GLM$GLMDriver.computeSubmodel(GLM.java:2488)
    at hex.glm.GLM$GLMDriver.doCompute(GLM.java:2626)
    at hex.glm.GLM$GLMDriver.computeImpl(GLM.java:2523)
    at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:246)
    at hex.glm.GLM$GLMDriver.compute2(GLM.java:1160)
    at water.H2O$H2OCountedCompleter.compute(H2O.java:1637)
    at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
    at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
    at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
    at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
    at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1
    at hex.glm.GLMTask$GLMMultinomialGradientBaseTask.computeGradientMultipliers(GLMTask.java:1036)
    at hex.glm.GLMTask$GLMMultinomialGradientTask.calMultipliersNGradients(GLMTask.java:1137)
    at hex.glm.GLMTask$GLMMultinomialGradientBaseTask.map(GLMTask.java:1072)
    at water.MRTask.compute2(MRTask.java:820)
    at water.H2O$H2OCountedCompleter.compute1(H2O.java:1640)
    at hex.glm.GLMTask$GLMMultinomialGradientTask$Icer.compute1(GLMTask$GLMMultinomialGradientTask$Icer.java)
    at water.H2O$H2OCountedCompleter.compute(H2O.java:1636)
    ... 5 more

Looking at the diff from to (https://github.com/h2oai/h2o-3/compare/jenkins-, there have been changes in the GLM class.

5 replies
Cristiano Sarmento
I have a question about modifying the h2o-flow project. I did some changes on it and now I want to make these changes persistent inside the main repo ( h2o-3 ). I know that the "make" command on h2o-flow project copies the files do h2o-3, but what I need to do to build h2o-3 again without loosing the h2o-flow changes? Thanks!
2 replies
Eunyeong Park

I am trying to use xgboost with sparkling water, but when I use the fit method, I get the following error message. Other models work fine. Does xgboost no longer support it?

My pyspark version is 3.2.1 and sparkling water is

from pysparkling.ml import H2OXGBoostClassifier
estimator = H2OXGBoost(labelCol = "CAPSULE")
model = estimator.fit(trainingDF)
Py4JJavaError                             Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_14728/2631432188.py in <module>
      1 from pysparkling.ml import H2OXGBoostClassifier
      2 estimator = H2OXGBoost(labelCol = "CAPSULE")
----> 3 model = estimator.fit(trainingDF)

~\anaconda3\lib\site-packages\pyspark\ml\base.py in fit(self, dataset, params)
    159                 return self.copy(params)._fit(dataset)
    160             else:
--> 161                 return self._fit(dataset)
    162         else:
    163             raise TypeError("Params must be either a param map or a list/tuple of param maps, "

~\anaconda3\lib\site-packages\pyspark\ml\wrapper.py in _fit(self, dataset)
    334     def _fit(self, dataset):
--> 335         java_model = self._fit_java(dataset)
    336         model = self._create_model(java_model)
    337         return self._copyValues(model)

~\anaconda3\lib\site-packages\pyspark\ml\wrapper.py in _fit_java(self, dataset)
    330         """
    331         self._transfer_params_to_java()
--> 332         return self._java_obj.fit(dataset._jdf)
    334     def _fit(self, dataset):

~\anaconda3\lib\site-packages\py4j\java_gateway.py in __call__(self, *args)
   1320         answer = self.gateway_client.send_command(command)
-> 1321         return_value = get_return_value(
   1322             answer, self.gateway_client, self.target_id, self.name)

~\anaconda3\lib\site-packages\pyspark\sql\utils.py in deco(*a, **kw)
    109     def deco(*a, **kw):
    110         try:
--> 111             return f(*a, **kw)
    112         except py4j.protocol.Py4JJavaError as e:
    113             converted = convert_exception(e.java_exception)

~\anaconda3\lib\site-packages\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
    324             value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
    325             if answer[1] == REFERENCE_TYPE:
--> 326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
    328                     format(target_id, ".", name), value)

Py4JJavaError: An error occurred while calling o2851.fit.
: ai.h2o.sparkling.backend.exceptions.RestApiCommunicationException: H2O node responded with
Status code: 404 : Not Found
Server error: {"__meta":{"schema_version":3,"schema_name":"H2OErrorV3","schema_type":"H2OError"},"timestamp":1647850376319,"error_url":"POST /3/ModelBuilders/xgboost","msg":"\n\nERROR MESSAGE:\n\nPOST /3/ModelBuilders/xgboost not found\n\n","dev_msg":"\n\nERROR MESSAGE:\n\nPOST /3/ModelBuilders/xgboost not found\n\n","http_status":404,"values":{},"exception_type":"water.exceptions.H2ONotFoundArgumentException","exception_msg":"\n\nERROR MESSAGE:\n\nPOST /3/ModelBuilders/xgboost not found\n\n","stacktrace":["water.exceptions.H2ONotFoundArgumentException: POST /3/ModelBuilders/xgboost not found","    water.api.RequestServer.response404(RequestServer.java:743)","    water.api.RequestServer.serve(RequestServer.java:467)","    water.api.RequestServer.doGeneric(RequestServer.java:301)","    water.api.RequestServer.doPost(RequestServer.java:227)","    javax.servlet.http.HttpServlet.service(HttpServlet.java:523)","    javax.servlet.http.HttpServlet.service(HttpServlet.java:590)","    ai.h2o.org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:865)","    ai.h2o.org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:535)","    ai.h2o.org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)","    ai.h2o.o
6 replies
Cristiano Sarmento
Where are h2o-3 uploaded files stored by default? Thanks!
I wanted to understand the meaning of this error and/or its root causes. Thanks
using h2o==
    340         if self.model:
--> 341             predictions = self.model.predict(frame)
    342             logger.debug(f"Prediction df columns: {predictions.columns}")
    343             result = frame.cbind(predictions).as_data_frame().drop('predict', axis=1)

/local_disk0/pythonVirtualEnvDirs/virtualEnv-bb6018b9-bae7-4520-a3b8-ee2f28e1c8c3/lib/python3.7/site-packages/h2o/model/model_base.py in predict(self, test_data, custom_metric, custom_metric_func)
    233             eval_func_ref = h2o.upload_custom_metric(custom_metric)
    234         if not isinstance(test_data, h2o.H2OFrame): raise ValueError("test_data must be an instance of H2OFrame")
--> 235         j = H2OJob(h2o.api("POST /4/Predictions/models/%s/frames/%s" % (self.model_id, test_data.frame_id), data = {'custom_metric_func': custom_metric_func}),
    236                    self._model_json["algo"] + " prediction")
    237         j.poll()

/local_disk0/pythonVirtualEnvDirs/virtualEnv-bb6018b9-bae7-4520-a3b8-ee2f28e1c8c3/lib/python3.7/site-packages/h2o/h2o.py in api(endpoint, data, json, filename, save_to)
    105     # type checks are performed in H2OConnection class
    106     _check_connection()
--> 107     return h2oconn.request(endpoint, data=data, json=json, filename=filename, save_to=save_to)

/local_disk0/pythonVirtualEnvDirs/virtualEnv-bb6018b9-bae7-4520-a3b8-ee2f28e1c8c3/lib/python3.7/site-packages/h2o/backend/connection.py in request(self, endpoint, data, json, filename, save_to)
    476                 save_to = save_to(resp)
    477             self._log_end_transaction(start_time, resp)
--> 478             return self._process_response(resp, save_to)
    480         except (requests.exceptions.ConnectionError, requests.exceptions.HTTPError) as e:

/local_disk0/pythonVirtualEnvDirs/virtualEnv-bb6018b9-bae7-4520-a3b8-ee2f28e1c8c3/lib/python3.7/site-packages/h2o/backend/connection.py in _process_response(response, save_to)
    825         # Client errors (400 = "Bad Request", 404 = "Not Found", 412 = "Precondition Failed")
    826         if status_code in {400, 404, 412} and isinstance(data, (H2OErrorV3, H2OModelBuilderErrorV3)):
--> 827             raise H2OResponseError(data)
    829         # Server errors (notably 500 = "Server Error")
H2OResponseError: Server error water.exceptions.H2OKeyNotFoundArgumentException:
  Error: Object 'Generic_model_python_1648692368835_1' not found in function: predict for argument: model
  Request: POST /4/Predictions/models/Generic_model_python_1648692368835_1/frames/py_5_sid_8e84
9 replies

how to force stop h2o cluster ?

I've local h2o cluster still running after adding the following code in my script

  if h2o.cluster():

before calling h2o.init()

I could see that my script is using and old cluster instead of stoping it and create new one

H2O_cluster_uptime:         56 mins 17 secs

How can I stop the old cluster ?

1 reply
Erin LeDell
@razou does h2o.cluster() return a True/False?
2 replies
Erin LeDell
something i have noticed, at least in R, is that it takes a few seconds to shut the cluster down, so you might want to add a sleep() in there to give it time to shut down before the new one starts up
Cristiano Sarmento
Good morning everyone. I have a doubt, when an Auto ML model is created, several auxiliary data files and models are created. Do you know where these files are stored within the h2o flow?
2 replies
ah one more question do you know if these files can be stored on an Amazon S3 bucket?

I want to ask a question about h2o auto ml. The code is below:

H2OAutoML(max_models=2, seed=50, max_runtime_secs=300, nfolds=5, stopping_metric='misclassification', stopping_tolerance=0.01, stopping_rounds=3, verbosity="warn", include_algos = ["GLM", "DeepLearning", "DRF","GBM"])

I wonder how h2oautoml to select the first model among "GLM", "DeepLearning", "DRF" and "GBM" to train? Is there any mechanism ? When h2oautoml finish to train the first model (maybe "GLM", "DeepLearning", "DRF" or "GBM"), how h2oautoml to select the second model among "GLM", "DeepLearning", "DRF" and "GBM" to train? Is there any mechanism ?

1 reply

What is the recommended approach to dynamically determining the mapping of cross-validation fold to cross-validation model? I've noticed that the mapping varies by version of h2o.

This is pretty slow, but currently I use h2o.cross_validation_predictions. This has a non-zero prediction for holdout records and zero if it was in the training of that cross-validation model. I can then line that up with my definition of the fold column to create a mapping.

My concern is that it appears this functionality is deprecated and could be removed:

[1] "Cross-validation predictions, one per cv model (deprecated, use cross_validation_holdout_predictions_frame_id instead)"

That recommended frame is the aggregate holdout predictions so I lose the ability to identify which cross-validation model it came from.

Perhaps a simpler solution would be to add the mapping of fold-to-cross-validation model to the resulting model object. Would that be possible?

Michal Kurka

Perhaps a simpler solution would be to add the mapping of fold-to-cross-validation model to the resulting model object. Would that be possible?

@hutch3232 good question and great suggestion, there should be more explicit mapping

I like it - I had the same thought at one point but didn't get to it yet - can you please file a jira ticket so that the idea doesn't get lost? https://h2oai.atlassian.net/

1 reply
Golam Rashed
Hello, Any simple tutorial available for AutoML?
1 reply

Hello, I'm testing out h2o.gam and am coming across something unintuitive. Does it make sense for identical training models (same deviance) to have different predictions (from the training model) when comparing with and without cross-validation?

train <- h2o.createFrame(cols = 5, seed = 22, seed_for_column_types = 55, factors = 3, missing_fraction = 0)
train$fold <- h2o.kfold_column(train, nfolds = 3, seed = 11)
train$response <- 50 + 
  ifelse(train$C5 == "c4.l0", 10,
         ifelse(train$C5 == "c4.l1", 15,
                ifelse(train$C5 == "c4.12", 20, 25))) +
  0.2 * train$C1 +
  - 0.05 * train$C2 +
  -0.2 * train$C4 - 0.005 * train$C4^2 + 0.00005 * train$C4^3 +

params <- list(
  x = c("C1", "C2", "C5"),
  y = "response", 
  training_frame = train,
  lambda = 0,
  keep_gam_cols = TRUE,
  gam_columns = c("C4"),
  scale = c(.05),
  num_knots = c(5),
  spline_orders = c(3)

# no cross validation, bs = 0 (default)
mod <- do.call(what = "h2o.gam", args = params)

h2o.residual_deviance(object = mod, train = TRUE)
# [1] 76080.37

h2o.predict(mod, train)
# predict
# 1 43.49629
# 2 61.16891
# 3 58.14821
# 4 49.06894
# 5 54.63423
# 6 33.10237
# [10000 rows x 1 column] 

# cross validation, bs = 0 (default)
mod2 <- do.call(what = "h2o.gam", args = c(params, fold_column = "fold"))

h2o.residual_deviance(object = mod2, train = TRUE)
# [1] 76080.37

h2o.predict(mod2, train)
# predict
# 1 115.53379
# 2  71.36690
# 3  44.06435
# 4  90.39080
# 5  66.77768
# 6 104.90591
# [10000 rows x 1 column]

It might also be worth looking at the models' h2o.residual_analysis_plot. mod looks pretty normal, but mod2 shows very strange patterns in the residuals. Not providing here, but different values of bs also had inconsistencies and strange residuals.

Thanks for any help understanding what is happening!


Hi Paul:

I tried your code and was able to reproduce the problem. When you use fold column in cross-validation or specify a weight column in your parameter, the practice is to remove the fold/weight column in your data frame before calling prediction on it.

I removed the fold column in prediction with mod2 but the prediction output are still not the same as from mod. We have a bug in our code and I have opened a JIRA: https://h2oai.atlassian.net/browse/PUBDEV-8681.

The bs specifies the type of splines GAM use and hence will build different models with different values.

Thank you,

1 reply
Simon Schmid
Hi all, is there an estimate when version will be published? This version fixes a bug I am waiting for.