I wanted to perform a frequency encoding on categorical features in a given frame
Suppose that I have a dictionary where keys are categorical columns and values are their frequencies. How can one replace each category with its frequency on the entire frame
Example with pandas data frame
fe = h2oDF_fe.groupBy(col).size()/len(fe) h2oDF_fe.loc[:, col_fe] = h2oDF_fe[col].map(fe)
Is there for h2o frames an equivalent function to pandas dataframe's
xgboost(..., backend = "auto")it is also crashing.
xgboost(..., backend = "cpu")works.
With a dictionary (imputer_dict) where keys are the same as my frame's columns I wanted to fill NAs in a frame with that dictionary's values as it is possible with pandas dataframes (e.g
DataFrame.fillna(value = imputer_dict))
For each column (
x) in my frame I wanted to replace its NAs with its value in imputer_dict (i.e
If anybody have an idea
I was trying to impute a frame with dictionary (each feature with it’s imputation value) :
frame.impute(values=[imputer_dict[col] for col in frame.columns])
I wanted to skip some columns (e.g
x) by adding
None as value in
imputer_dict as said in the documentation and I’m guetting th following error
The documenation is confused to me. Anybody has an idea of what caused the error
I have a question.
A quick background on what I did with H2O is that I am able to perform predictions in my own Java environment using the MOJO that I downloaded from H2O.
I am planning to use the outputs (Scoring History, Variable Importances, etc) from the model that I built in H2O.
I want to use this data and display them in my own development environment.
I have checked H2O's documentation (http://docs.h2o.ai/h2o/latest-stable/h2o-genmodel/javadoc/index.html).
However, I can't seem to find any sample that shows this.
I want to ask. Is it possible to retrieve and display this data using the MOJO?
If so, how can I do this? Is there a documentation reference I can look into for this?
I have a question about the documentation for the stopping_rounds parameter for h2o’s models: https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/stopping_rounds.html
To disable this feature, specify 0. When disabled, the metric is computed on the validation data (if provided); otherwise, training data is used.
I was wondering if this was a typo; was it meant to say
when enabled rather than
when disabled? Or does this mean that if enabled, early stopping will be computed using the training data rather than the validation data?
Hi. I am running H2O with Python.
So far, what I am able to do is build a GBM model and print its data. Below is my sample code.
gbm_model = H2OGradientBoostingEstimator(ntrees=100, max_depth=4, learn_rate=0.1)
gbm_model.train(predictors, response, training_frame=trainingFrame)
Printing gbm_model displays tables of data like Scoring History and Variable Importances.
What I want to achieve is retrieve each data (with its header name) so that I can map and display those data on my own way.
So, I tried to access the Variable Importances data by looping through it.
print("Loop through Variable Importance Items")
varImp = gbm_model.varimp()
for varImpItem in varImp:
for item in varImpItem:
For additional info, gbm_model.varimp() returns a ModelBase object.
However, what was retrieved was only the data itself.
The header names (variable, relative_importance, scaled_importance, percentage) were not included for the display.
I want to ask, is there a way to retrieve the header names for this? If so, how can I do it?
I was comparing h2o.xgboost vs the native xgboost by following the instructions written in estimator_base.py:
h2o.init() training_hf = h2o.import_file("train.csv") h2o_booster = H2OXGBoostEstimator(distribution="bernoulli", seed=0, ntrees=10, max_depth=5, min_split_improvement=0.1, learn_rate=0.1, sample_rate=0.9, col_sample_rate_per_tree=0.9, min_rows=2 ) label = "response" features = training_hf.columns features.remove(label) training_hf[label] = training_hf[label].asfactor() h2o_booster.train(x=features, y=label, training_frame=training_hf) h2oPredict = h2o_booster.predict(training_hf).as_data_frame()['p1'].values nativeDMatrix = training_hf.convert_H2OFrame_2_DMatrix(features, label, h2o_booster) nativeDMatrix.feature_names = features nativeParams = h2o_booster.convert_H2OXGBoostParams_2_XGBoostParams() nativeModel = xgb.train(params=nativeParams, dtrain=nativeDMatrix, num_boost_round=nativeParams) nativePredict = nativeModel.predict(data=nativeDMatrix, ntree_limit=nativeParams)
Apparently the predictions are just very close(not because of rounding), but definitely not exactly the same. Did I miss anything from the instruction?
Moreover, the tree structure starts to diverge a lot after a few initial ones being very similar.
Has anyone managed to get XGBoost in h2o-3 to use a GPU backend when running in a docker container? And, if so, can you give me some pointers?
I'm running out of ideas and the only output I get from h2o to debug is:
ERRR on field: _backend: GPU backend (gpu_id: 0) is not functional. Check CUDA_PATH and/or GPU installation.
Is there any way to get something more verbose?
For context. I'm using Metaflow in combination with AWS Batch.
I have tested whether the container can see the GPU using pynvml and it seems to work. I also ran a test script using tensorflow-gpu and that seemed to work too. That leads me to conclude the problem is with h2o and/or cuda.
The documentation on configuring h2o to use GPUs with XGBoost is pretty limited in scope and as far as I can tell, I'm meeting the requirements.
Any help/advice much appreciated...
Hi all! Found a bug in error handling:
raise ValueError("'test_data' must be of type H2OFrame. Got: " + type(test_data))
type in Python 3.7.4 is not a string, can't concatenate
It should probably be replaced with
raise ValueError("'test_data' must be of type H2OFrame. Got: " + type(test_data).__name__)
Would anyone be able to give me some pointers on optimising resource allocation for XGBoost training in h2o? I'm sure I read in the documentation that you needed to leave a proportion of the available cpu/memory for XGBoost? Is this correct, or should I be giving H2O as much as possible?
For example. If I have an instance with 10 cpu and 100GB RAM what should I allocate directly to h2o and what, if any, should I keep free for XGBoost?
h2o.explain()features in h2o python.