Where communities thrive

  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
Repo info
  • Jun 19 2018 00:00
    @ledell banned @renatomarinho
Is there any possibility to implement Rotation Forest in H2O? Based on H2O's great Random Forest and Principal Component Analysis algorithms, It seems that very easy to get a Rotation Forest? A report in arxiv https://arxiv.org/abs/1809.06705 shows that Rotation Forest is the best classifier for problems
with continuous features.
1 reply

With a dictionary (imputer_dict) where keys are the same as my frame's columns I wanted to fill NAs in a frame with that dictionary's values as it is possible with pandas dataframes (e.g DataFrame.fillna(value = imputer_dict))

For each column (x) in my frame I wanted to replace its NAs with its value in imputer_dict (i.e imputer_dict[x])

If anybody have an idea


12 replies

I was trying to impute a frame with dictionary (each feature with it’s imputation value) :
frame.impute(values=[imputer_dict[col] for col in frame.columns])
I wanted to skip some columns (e.g x) by adding None as value in imputer_dict as said in the documentation and I’m guetting th following error

The documenation is confused to me. Anybody has an idea of what caused the error

2 replies
Gabriel Fields


I have a question.
A quick background on what I did with H2O is that I am able to perform predictions in my own Java environment using the MOJO that I downloaded from H2O.

I am planning to use the outputs (Scoring History, Variable Importances, etc) from the model that I built in H2O.
I want to use this data and display them in my own development environment.
I have checked H2O's documentation (http://docs.h2o.ai/h2o/latest-stable/h2o-genmodel/javadoc/index.html).
However, I can't seem to find any sample that shows this.

I want to ask. Is it possible to retrieve and display this data using the MOJO?
If so, how can I do this? Is there a documentation reference I can look into for this?



I have a question about the documentation for the stopping_rounds parameter for h2o’s models: https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/stopping_rounds.html

It says: To disable this feature, specify 0. When disabled, the metric is computed on the validation data (if provided); otherwise, training data is used.

I was wondering if this was a typo; was it meant to say when enabled rather than when disabled? Or does this mean that if enabled, early stopping will be computed using the training data rather than the validation data?


1 reply
Oscar Pan
Hi I was playing with h2o glm (binomial) and this is the predictions. Is it wrong? Why p1 is so small but the predict label are all "1"?
Gabriel Fields

Hi. I am running H2O with Python.

So far, what I am able to do is build a GBM model and print its data. Below is my sample code.

gbm_model = H2OGradientBoostingEstimator(ntrees=100, max_depth=4, learn_rate=0.1)
gbm_model.train(predictors, response, training_frame=trainingFrame)

Printing gbm_model displays tables of data like Scoring History and Variable Importances.
What I want to achieve is retrieve each data (with its header name) so that I can map and display those data on my own way.
So, I tried to access the Variable Importances data by looping through it.

print("Loop through Variable Importance Items")
varImp = gbm_model.varimp()

for varImpItem in varImp:
for item in varImpItem:
print(" ")

For additional info, gbm_model.varimp() returns a ModelBase object.

However, what was retrieved was only the data itself.
The header names (variable, relative_importance, scaled_importance, percentage) were not included for the display.

I want to ask, is there a way to retrieve the header names for this? If so, how can I do it?

Hi @gabfields02
You can retrieve each data from this command gbm_model._model_json['output’]
  • For varibale importance:
    And if you want it in a dataframe
var_imp = gbm_model._model_json[‘output’][‘variable_importances'].as_data_frame()
11 replies
I am using EasyPredicModelWrapper to run trained Models.
For numerical and categorical input values this works fine.
But I noticed that there are problems with time columns: Time is not converted automatically, as in the flow interface, but an error is thrown.
Do Time Columns need to be addressed in a special way?
Oscar Pan

I was comparing h2o.xgboost vs the native xgboost by following the instructions written in estimator_base.py:

        training_hf = h2o.import_file("train.csv")
        h2o_booster = H2OXGBoostEstimator(distribution="bernoulli",
        label = "response"
        features = training_hf.columns
        training_hf[label] = training_hf[label].asfactor()

        h2o_booster.train(x=features, y=label, training_frame=training_hf)
        h2oPredict = h2o_booster.predict(training_hf).as_data_frame()['p1'].values

        nativeDMatrix = training_hf.convert_H2OFrame_2_DMatrix(features, label, h2o_booster)
        nativeDMatrix.feature_names = features
        nativeParams = h2o_booster.convert_H2OXGBoostParams_2_XGBoostParams()
        nativeModel = xgb.train(params=nativeParams[0], dtrain=nativeDMatrix, num_boost_round=nativeParams[1])
        nativePredict = nativeModel.predict(data=nativeDMatrix, ntree_limit=nativeParams[1])

Apparently the predictions are just very close(not because of rounding), but definitely not exactly the same. Did I miss anything from the instruction?

Moreover, the tree structure starts to diverge a lot after a few initial ones being very similar.

2 replies
I am trying to run: localH2O = h2o.init(ip="localhost", port = 54321, startH2O = TRUE, nthreads=-1) BUT GETTING THIS Error: Error in h2o.init(ip = "localhost", port = 54321, startH2O = TRUE, nthreads = -1) :
H2O failed to start, stopping execution.
Erin LeDell
@surajenv_twitter theres not enough info. please search on Stack Overflow first... i think this question has been solved a few times already
Hi,everyone . i'm new here.Actually i wanna know .if there were any cases of H2O that had something to do with fault diagnosis for construction machines like excavator.Thanks a lot for your answers!:)
Gabriel Fields
Is there a way to run one H2O instance and access that same instance from different servers?
If so, how can I do it? Would it also be possible when running H2O with Python?
Owen Ball

Hi All,

Has anyone managed to get XGBoost in h2o-3 to use a GPU backend when running in a docker container? And, if so, can you give me some pointers?

I'm running out of ideas and the only output I get from h2o to debug is:

ERRR on field: _backend: GPU backend (gpu_id: 0) is not functional. Check CUDA_PATH and/or GPU installation.

Is there any way to get something more verbose?

For context. I'm using Metaflow in combination with AWS Batch.

  • The AWS batch AMI is a slightly customised build on top of the ECS GPU Optimised AMI which includes nvidia driver 418.87.00
  • The container image is built on nvidia/cuda runtime centos 7. I've tried every version of CUDA, but currently have cuda 8 and then install the the cuda 9 libraries
  • /usr/local/cuda symlink points to cuda 9
  • h2o version is
  • Packages and Python 3 version are all managed with conda environment
  • Instance type is p3.2xlarge with nvidia Tesla

I have tested whether the container can see the GPU using pynvml and it seems to work. I also ran a test script using tensorflow-gpu and that seemed to work too. That leads me to conclude the problem is with h2o and/or cuda.

The documentation on configuring h2o to use GPUs with XGBoost is pretty limited in scope and as far as I can tell, I'm meeting the requirements.

Any help/advice much appreciated...


9 replies
Im not sure where the H2o suggestion box is, but would it be possible to impose monotonicity on gam_columns?
Hey, is there a possibility to get the same dateformat parsing from the frame parsing process (water.parser.ParseTime) also with the EasyPredictModelWrapper? Right now it's a bit tedious to use time types with a mojo, as it expects a unix timestamp and not the dateformat which was originally uploaded. Therefore the training data or rather data in the same original format can't be directly used for predictions...
Gabriel Fields
Hello, does anyone know how to start two instances of H2O in Python? I have only found documentation in R. I was wondering if there is a documentation in Python for this.
22 replies
Seiji Kumagai
Hi, I'd like to use context_path as an argument in h2o.init() in python, so I made a pull request #4911. Can somebody review it and let me know the next step I need to take?
Tom Roderick

Hi all! Found a bug in error handling:

  • h2o\model\model_base.py
  • Line 378:
    raise ValueError("'test_data' must be of type H2OFrame. Got: " + type(test_data))

type in Python 3.7.4 is not a string, can't concatenate
It should probably be replaced with
raise ValueError("'test_data' must be of type H2OFrame. Got: " + type(test_data).__name__)

1 reply
I'm not sure what OS contribution looks like, couldn't see any obvious path in github
Sandeep Kunsoth
hi all i am getting this error when running on vm. raise H2OServerError("HTTP %d %s:\n%r" % (status_code, response.reason, data)). h2o.exceptions.H2OServerError: HTTP 400 Bad Request:
h20 version: it worked somedays before same code is not working now please help currently its working in local only
7 replies
When I am training with XGBoost in H2O (non GPU version), and when I list all frames, I found that the train and validation frame within each CV fold contain exactly the same number of rows, is this a mistake in CV splitting?
7 replies
Owen Ball

Would anyone be able to give me some pointers on optimising resource allocation for XGBoost training in h2o? I'm sure I read in the documentation that you needed to leave a proportion of the available cpu/memory for XGBoost? Is this correct, or should I be giving H2O as much as possible?

For example. If I have an instance with 10 cpu and 100GB RAM what should I allocate directly to h2o and what, if any, should I keep free for XGBoost?

2 replies
Gabriel Fields
Hi. I want to ask, how can I get a list of all the model IDs in my cluster in Python?
I can successfully retrieve a model using h2o.get_model(model_id).
However, I am manually inputting or hard coding the model_id.
Other than H2O Flow (Models >> List All Models), I want to know if there is a way to list all the model IDs in the cluster. Thanks.
2 replies
Hi all, I was wondering if anyone had success in using h2o.explain() features in h2o python.
Basically the feature is non-existent in - which is the latest version for python ?
I created an issue here https://h2oai.atlassian.net/jira/software/c/projects/PUBDEV/issues/PUBDEV-7850?jql=project%20%3D%20%22PUBDEV%22%20ORDER%20BY%20created%20DESC
1 reply
Gabriel Fields
Hi. I just want to clarify. Is importing a MOJO for scoring the same as using an imported binary model with h2o.load_model(yourModel)?
1 reply
Chen Kepeng
Hi all, I want to know how can I do the i18n thing on h2o flow web UI? any clues is appreciated.
Hello, I wanted to know ho to perform stratified sampling in h2o ?
1 reply

When importing from JBDC what should the connection string look like?



2 replies
Hey, I wanted to ask if there is any chance this feature request may be implemented https://h2oai.atlassian.net/browse/PUBDEV-7700?
(K/V:13.2 MB + POJO:17.1 MB + FREE:464.6 MB == MEM_MAX:494.9 MB), desiredKV=2.15 GB OOM
What does the FREE value mean? And why am I OOM if I've got 10x as much FREE as K/V and POJO?
I wanted how to perform resampling (with or without replacement ) in H2O ? If it native function for that
The purpose is to down sample or over sample target feature's classes in imbalanced data for classification.
Thank you
Igor Trpovski
Hi everyone,
When I run grid search with parallelism set to 0 or some value different than 1 it always hangs. In other words, after some time the progress bar can reach 100% but the grid never finishes. It happens with both the Cartesian and RandomDiscrete strategies. The model that I'm using is GBM and the cross validation folds are specified through fold_column.
When I run the grid with the default value for parallelism, some parameter combinations fail due to the dataset being very small (<100 rows) but the grid finishes.
Did anyone have a similar problem? I don't know how to debug this issue.
3 replies

I'm training a GBM multi classifier and I wanted to know, what could cause the following error. Thanks

raw_df  = h2o.import_file()
df  =  h2o.deep_copy(raw_df[raw_df['x'] > 10, : ], 'df')

df['split']  = df['y'] .stratified_split(test_frac=0.2,  seed=1) 
train_valid = df[df['split'] == 'train', :].drop('split')
test  = df[df['split'] == 'test', :].drop('split')

train_valid['col_split']  = train_valid['y'] .stratified_split(test_frac=0.2,  seed=1) 

train = df[df['split'] == 'train', :].drop('col_split')
valid  = df[df['split'] == 'test', :].drop('col_split')

raw_df['y'].unique().nrow => 95
df['y'].unique().nrow => 93
train['y'].unique().nrow => 93

training GBM alog with class_sampling_factors = [w1, ...., W93]

OSError: Job with key $03017f00000132d4ffffffff$_af9c11386cb765249816853dfc3d47fe failed with an exception: java.lang.IllegalArgumentException: class_sampling_factors must have 95 elements
java.lang.IllegalArgumentException: class_sampling_factors must have 95 elements
    at hex.tree.SharedTree$Driver.computeImpl(SharedTree.java:244)
    at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:238)
    at water.H2O$H2OCountedCompleter.compute(H2O.java:1563)
    at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
    at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
    at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
    at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
    at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

or the following one, when using "balance_classes": True in GBM model

OSError: Job with key $03017f00000132d4ffffffff$_acb90549c4fb00eefd9be1d55ab5448b failed with an exception: java.lang.IllegalArgumentException: Error during sampling - too few points?
java.lang.IllegalArgumentException: Error during sampling - too few points?
    at water.util.MRUtils.sampleFrameStratified(MRUtils.java:309)
    at hex.tree.SharedTree$Driver.computeImpl(SharedTree.java:252)
    at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:238)
    at water.H2O$H2OCountedCompleter.compute(H2O.java:1563)
    at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
    at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
    at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
    at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
    at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
2 replies
Hi all. H20 says in the documentation that splitting on a feature for regression gbms is based on the reduction in squared error. Is this squared error based on the node residuals, ie (resid - mean resid)^2 or is it the true response, ie (response - mean response). Im using gamma/ poisson distributions..
Does calibration (with h2o lib) works only for binary classification ?