Hi all. I used H2o's Isolation Forest algorithm implementation in Python 3 in an AWS cluster environment (not sure which of these details is relevant). FYI, I am a data scientist, not a software engineer, so I am not proficient in Java, which I see a lot of the code is in.
My question is: is there a way to extract/save/see the attributes and split values selected for each of the trees that are trained for the isolation forest? I have scoured the documentation and looked at the code on GitHub without seeing any obvious way to do so. My use case is: demonstrating to a non-technical audience how these trees are, since they are skeptical of the "black-box" and lack of understanding of what attributes/split values the observations are being isolated by.
Thanks.
h2o_frame = h2o.import_file(CSV_PATH, pattern='{0}_[0-9]+.csv$'.format('train'))
Server error water.exceptions.H2OIllegalArgumentException:
Error: File type mismatch. Cannot parse files [train_115092601.csv] and [train_202032.csv] of type CSV and CSV as one dataset.
import logging
logging.basicConfig(level=logging.DEBUG, format='%(asctime)s %(name)s %(levelname)s:%(message)s')
logger = logging.getLogger(__name__)
logger.info('Training size: ', train.nrow)
logger.info('Validation size: ', validation.nrow)
Hello
Having a frame with some categorical columns (X1, .., X5) and X3 had only NaN
in initial csv file
Strange behavior when using dtypes
method
h2o_df['X3'] = h2o_df['X3'].ascharacter().asfactor()
categorical_cols = [k for (k, v) in h2o_df.types.items() if v == 'enum' and k not in ['y']]
returns [X1, X2, X3,X4,X5,C1] instead of [X1, X2, X3, X4, X5]
Why the column C1
was added ?
Looking for feedback from the h2o community on how they productionize h2o models.
Based on the documentation, looks like java application is the standard for productionizing h2o models.
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/productionizing.html
h2o.upload_mojo()
function to load a h2o MOJO model and use it for prediction or using h2o.mojo_predict_csv
/ h2o.mojo_predict_pandas
to get prediction from a h2o model without even starting an h2o cluster Is this not a good standard?startTime
. The first processed one then will take 12:00:00 as startTime
and the second call will first wait and then retrieve new start times until it is at least 12:00:01. 12:00:01 will then be saved as lastStartTime
which means that the third call is actually fine with keeping 12:00:00. This will then result in an error as it produces duplicate models ids together with the run of the first call. Not sure what the best fix is, probably just checking that startTime
is a time after lastStartTime
.
sys.ai.h2o.heartbeat.benchmark.enabled=false
. However, I am wondering what the downside is of disabling it.
Hello. I'm having an annoyingly time-consuming issue and was wondering if anyone here has any suggestions.
I'm basically using h2o to train and cross-validate some ANNs on a few different bio-logging data sets (some immersion and some acceleration). It works fine for the immersion datasets (which are all binary and <120MB), completing in just a few hours, but for some reason hangs at 100% on training on the first acceleration one (which are float and fairly larger - 170MB-8GB).
I suspect its a memory issue, but with no error message I have no idea how to troubleshoot this and proceed. I've been stuck on this for a week now coming up to the climax of a masters project! Does anyone have any ideas?
The part where it hangs:
deeplearning Model Build progress: <progress bar> 100%
CODE:
#!/usr/bin/env python3
import h2o
from h2o.estimators import H2ODeepLearningEstimator
import glob
import re
h2o.init(min_mem_size='30G', max_mem_size="100G")
files = glob.glob('../Data/Reduced/ACC*.csv')
for f in files:
# Load data
data = h2o.import_file(f, header=1)
data['Dive'] = data['Dive'].asfactor()
data['BirdID'] = data['BirdID'].asfactor()
# Extract model ID from filepath
wdw = re.search(r"/ACC(\d+)_reduced", f).group(1)
# Build, train, and cross-validate model
dl_cross = H2ODeepLearningEstimator(model_id = 'ACC_window_' + wdw,
distribution = "bernoulli",
hidden = [200, 200],
fold_column = 'BirdID',
keep_cross_validation_models = True,
keep_cross_validation_fold_assignment = True,
keep_cross_validation_predictions = True,
score_each_iteration = True,
epochs = 50,
train_samples_per_iteration = -1,
activation = "RectifierWithDropout",
#input_dropout_ratio = 0.2,
hidden_dropout_ratios = [0.2, 0.2],
single_node_mode = False,
balance_classes = False,
force_load_balance = False,
seed = 23123,
score_training_samples = 0,
score_validation_samples = 0,
stopping_rounds = 0)
print('Training...')
dl_cross.train(x = data.columns[1:-1],
y="Dive",
training_frame=data)
# Save model
print('Saving...')
h2o.save_model(model=dl_cross, path="../Data/Reduced/H2O_ACC_XVal_Models/", force=True)
# Close session
h2o.shutdown()
metalearner_algorithm
to gbm
or xgboost
, or fine-tuning GLM parameters ,Hello, I was wondering if subsetting via the result of h2o.which
could be supported? I think that workflow is fairly common in base R so that you can identify row-indices once, then re-use those for subsequent subsets. Currently h2o.which
returns a frame which can't be used to subset. You can convert that result to a vector but that is very inefficient with big data.
I have a really specific use-case for this that I posted on stackoverflow: https://stackoverflow.com/q/69366490/9244371 but it errors out saying rectangle assignment is unimplemented (if I'm reading that right). Thanks!
Hello, I think I found a bug in the GLM implementation when running multiple GLM training process in parallel. Actually, it is a regression because it used to work fine for me (I wanted to update h2o from 3.32.1.2 to 3.36.0.3). I tried to find the version that introduced the issue and it seems it has been with 3.32.1.5. I have a Java application that allows me to reproduce the issue (I can share it if that helps), in general it trains multiple GLM models in parallel. Important is that it trains multinomial and binomial models concurrently, the exception does not occur for just binomial or just multinomial models. From time to time, such an exception is thrown:
Exception in thread "main" java.util.concurrent.ExecutionException: DistributedException from /192.168.178.163:54321: 'Index 1 out of bounds for length 1', caused by java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1
at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
at src.main.java.GLMTest.main(GLMTest.java:46)
Caused by: DistributedException from /192.168.178.163:54321: 'Index 1 out of bounds for length 1', caused by java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1
at water.MRTask.getResult(MRTask.java:654)
at water.MRTask.getResult(MRTask.java:664)
at water.MRTask.doAll(MRTask.java:524)
at water.MRTask.doAll(MRTask.java:476)
at hex.glm.GLM$GLMGradientSolver.getGradient(GLM.java:3367)
at hex.glm.ComputationState$2.getGradient(ComputationState.java:563)
at hex.glm.ComputationState$2.getObjective(ComputationState.java:573)
at hex.optimization.OptimizationUtils$SimpleBacktrackingLS.<init>(OptimizationUtils.java:74)
at hex.glm.GLM$GLMDriver.fitIRLSM_multinomial(GLM.java:1598)
at hex.glm.GLM$GLMDriver.fitModel(GLM.java:2028)
at hex.glm.GLM$GLMDriver.computeSubmodel(GLM.java:2488)
at hex.glm.GLM$GLMDriver.doCompute(GLM.java:2626)
at hex.glm.GLM$GLMDriver.computeImpl(GLM.java:2523)
at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:246)
at hex.glm.GLM$GLMDriver.compute2(GLM.java:1160)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1637)
at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1
at hex.glm.GLMTask$GLMMultinomialGradientBaseTask.computeGradientMultipliers(GLMTask.java:1036)
at hex.glm.GLMTask$GLMMultinomialGradientTask.calMultipliersNGradients(GLMTask.java:1137)
at hex.glm.GLMTask$GLMMultinomialGradientBaseTask.map(GLMTask.java:1072)
at water.MRTask.compute2(MRTask.java:820)
at water.H2O$H2OCountedCompleter.compute1(H2O.java:1640)
at hex.glm.GLMTask$GLMMultinomialGradientTask$Icer.compute1(GLMTask$GLMMultinomialGradientTask$Icer.java)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1636)
... 5 more
Looking at the diff from 3.32.1.4 to 3.32.1.5 (https://github.com/h2oai/h2o-3/compare/jenkins-3.32.1.4...jenkins-3.32.1.5), there have been changes in the GLM class.
I am trying to use xgboost with sparkling water, but when I use the fit method, I get the following error message. Other models work fine. Does xgboost no longer support it?
My pyspark version is 3.2.1 and sparkling water is 3.36.0.3-1-3.2.
from pysparkling.ml import H2OXGBoostClassifier
estimator = H2OXGBoost(labelCol = "CAPSULE")
model = estimator.fit(trainingDF)
Py4JJavaError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_14728/2631432188.py in <module>
1 from pysparkling.ml import H2OXGBoostClassifier
2 estimator = H2OXGBoost(labelCol = "CAPSULE")
----> 3 model = estimator.fit(trainingDF)
~\anaconda3\lib\site-packages\pyspark\ml\base.py in fit(self, dataset, params)
159 return self.copy(params)._fit(dataset)
160 else:
--> 161 return self._fit(dataset)
162 else:
163 raise TypeError("Params must be either a param map or a list/tuple of param maps, "
~\anaconda3\lib\site-packages\pyspark\ml\wrapper.py in _fit(self, dataset)
333
334 def _fit(self, dataset):
--> 335 java_model = self._fit_java(dataset)
336 model = self._create_model(java_model)
337 return self._copyValues(model)
~\anaconda3\lib\site-packages\pyspark\ml\wrapper.py in _fit_java(self, dataset)
330 """
331 self._transfer_params_to_java()
--> 332 return self._java_obj.fit(dataset._jdf)
333
334 def _fit(self, dataset):
~\anaconda3\lib\site-packages\py4j\java_gateway.py in __call__(self, *args)
1319
1320 answer = self.gateway_client.send_command(command)
-> 1321 return_value = get_return_value(
1322 answer, self.gateway_client, self.target_id, self.name)
1323
~\anaconda3\lib\site-packages\pyspark\sql\utils.py in deco(*a, **kw)
109 def deco(*a, **kw):
110 try:
--> 111 return f(*a, **kw)
112 except py4j.protocol.Py4JJavaError as e:
113 converted = convert_exception(e.java_exception)
~\anaconda3\lib\site-packages\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
325 if answer[1] == REFERENCE_TYPE:
--> 326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
328 format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o2851.fit.
: ai.h2o.sparkling.backend.exceptions.RestApiCommunicationException: H2O node http://192.168.0.218:54325 responded with
Status code: 404 : Not Found
Server error: {"__meta":{"schema_version":3,"schema_name":"H2OErrorV3","schema_type":"H2OError"},"timestamp":1647850376319,"error_url":"POST /3/ModelBuilders/xgboost","msg":"\n\nERROR MESSAGE:\n\nPOST /3/ModelBuilders/xgboost not found\n\n","dev_msg":"\n\nERROR MESSAGE:\n\nPOST /3/ModelBuilders/xgboost not found\n\n","http_status":404,"values":{},"exception_type":"water.exceptions.H2ONotFoundArgumentException","exception_msg":"\n\nERROR MESSAGE:\n\nPOST /3/ModelBuilders/xgboost not found\n\n","stacktrace":["water.exceptions.H2ONotFoundArgumentException: POST /3/ModelBuilders/xgboost not found"," water.api.RequestServer.response404(RequestServer.java:743)"," water.api.RequestServer.serve(RequestServer.java:467)"," water.api.RequestServer.doGeneric(RequestServer.java:301)"," water.api.RequestServer.doPost(RequestServer.java:227)"," javax.servlet.http.HttpServlet.service(HttpServlet.java:523)"," javax.servlet.http.HttpServlet.service(HttpServlet.java:590)"," ai.h2o.org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:865)"," ai.h2o.org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:535)"," ai.h2o.org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)"," ai.h2o.o
h2o==3.32.1.3
340 if self.model:
--> 341 predictions = self.model.predict(frame)
342 logger.debug(f"Prediction df columns: {predictions.columns}")
343 result = frame.cbind(predictions).as_data_frame().drop('predict', axis=1)
/local_disk0/pythonVirtualEnvDirs/virtualEnv-bb6018b9-bae7-4520-a3b8-ee2f28e1c8c3/lib/python3.7/site-packages/h2o/model/model_base.py in predict(self, test_data, custom_metric, custom_metric_func)
233 eval_func_ref = h2o.upload_custom_metric(custom_metric)
234 if not isinstance(test_data, h2o.H2OFrame): raise ValueError("test_data must be an instance of H2OFrame")
--> 235 j = H2OJob(h2o.api("POST /4/Predictions/models/%s/frames/%s" % (self.model_id, test_data.frame_id), data = {'custom_metric_func': custom_metric_func}),
236 self._model_json["algo"] + " prediction")
237 j.poll()
/local_disk0/pythonVirtualEnvDirs/virtualEnv-bb6018b9-bae7-4520-a3b8-ee2f28e1c8c3/lib/python3.7/site-packages/h2o/h2o.py in api(endpoint, data, json, filename, save_to)
105 # type checks are performed in H2OConnection class
106 _check_connection()
--> 107 return h2oconn.request(endpoint, data=data, json=json, filename=filename, save_to=save_to)
108
109
/local_disk0/pythonVirtualEnvDirs/virtualEnv-bb6018b9-bae7-4520-a3b8-ee2f28e1c8c3/lib/python3.7/site-packages/h2o/backend/connection.py in request(self, endpoint, data, json, filename, save_to)
476 save_to = save_to(resp)
477 self._log_end_transaction(start_time, resp)
--> 478 return self._process_response(resp, save_to)
479
480 except (requests.exceptions.ConnectionError, requests.exceptions.HTTPError) as e:
/local_disk0/pythonVirtualEnvDirs/virtualEnv-bb6018b9-bae7-4520-a3b8-ee2f28e1c8c3/lib/python3.7/site-packages/h2o/backend/connection.py in _process_response(response, save_to)
825 # Client errors (400 = "Bad Request", 404 = "Not Found", 412 = "Precondition Failed")
826 if status_code in {400, 404, 412} and isinstance(data, (H2OErrorV3, H2OModelBuilderErrorV3)):
--> 827 raise H2OResponseError(data)
828
829 # Server errors (notably 500 = "Server Error")
H2OResponseError: Server error water.exceptions.H2OKeyNotFoundArgumentException:
Error: Object 'Generic_model_python_1648692368835_1' not found in function: predict for argument: model
Request: POST /4/Predictions/models/Generic_model_python_1648692368835_1/frames/py_5_sid_8e84
Hi
how to force stop h2o cluster ?
I've local h2o cluster still running after adding the following code in my script
if h2o.cluster():
h2o.cluster().shutdown(prompt=False)
`
before calling h2o.init()
I could see that my script is using and old cluster instead of stoping it and create new one
H2O_cluster_uptime: 56 mins 17 secs
How can I stop the old cluster ?
I want to ask a question about h2o auto ml. The code is below:
H2OAutoML(max_models=2, seed=50, max_runtime_secs=300, nfolds=5, stopping_metric='misclassification', stopping_tolerance=0.01, stopping_rounds=3, verbosity="warn", include_algos = ["GLM", "DeepLearning", "DRF","GBM"])
I wonder how h2oautoml to select the first model among "GLM", "DeepLearning", "DRF" and "GBM" to train? Is there any mechanism ? When h2oautoml finish to train the first model (maybe "GLM", "DeepLearning", "DRF" or "GBM"), how h2oautoml to select the second model among "GLM", "DeepLearning", "DRF" and "GBM" to train? Is there any mechanism ?
What is the recommended approach to dynamically determining the mapping of cross-validation fold to cross-validation model? I've noticed that the mapping varies by version of h2o.
This is pretty slow, but currently I use h2o.cross_validation_predictions
. This has a non-zero prediction for holdout records and zero if it was in the training of that cross-validation model. I can then line that up with my definition of the fold column to create a mapping.
My concern is that it appears this functionality is deprecated and could be removed:
model_object@model$help$cross_validation_predictions
[1] "Cross-validation predictions, one per cv model (deprecated, use cross_validation_holdout_predictions_frame_id instead)"
That recommended frame is the aggregate holdout predictions so I lose the ability to identify which cross-validation model it came from.
Perhaps a simpler solution would be to add the mapping of fold-to-cross-validation model to the resulting model object. Would that be possible?
Perhaps a simpler solution would be to add the mapping of fold-to-cross-validation model to the resulting model object. Would that be possible?
@hutch3232 good question and great suggestion, there should be more explicit mapping
I like it - I had the same thought at one point but didn't get to it yet - can you please file a jira ticket so that the idea doesn't get lost? https://h2oai.atlassian.net/
Hello, I'm testing out h2o.gam
and am coming across something unintuitive. Does it make sense for identical training models (same deviance) to have different predictions (from the training model) when comparing with and without cross-validation?
train <- h2o.createFrame(cols = 5, seed = 22, seed_for_column_types = 55, factors = 3, missing_fraction = 0)
train$fold <- h2o.kfold_column(train, nfolds = 3, seed = 11)
train$response <- 50 +
ifelse(train$C5 == "c4.l0", 10,
ifelse(train$C5 == "c4.l1", 15,
ifelse(train$C5 == "c4.12", 20, 25))) +
0.2 * train$C1 +
- 0.05 * train$C2 +
-0.2 * train$C4 - 0.005 * train$C4^2 + 0.00005 * train$C4^3 +
5*h2o.runif(train)
params <- list(
x = c("C1", "C2", "C5"),
y = "response",
training_frame = train,
lambda = 0,
keep_gam_cols = TRUE,
gam_columns = c("C4"),
scale = c(.05),
num_knots = c(5),
spline_orders = c(3)
)
# no cross validation, bs = 0 (default)
mod <- do.call(what = "h2o.gam", args = params)
h2o.residual_deviance(object = mod, train = TRUE)
# [1] 76080.37
h2o.predict(mod, train)
# predict
# 1 43.49629
# 2 61.16891
# 3 58.14821
# 4 49.06894
# 5 54.63423
# 6 33.10237
#
# [10000 rows x 1 column]
# cross validation, bs = 0 (default)
mod2 <- do.call(what = "h2o.gam", args = c(params, fold_column = "fold"))
h2o.residual_deviance(object = mod2, train = TRUE)
# [1] 76080.37
h2o.predict(mod2, train)
# predict
# 1 115.53379
# 2 71.36690
# 3 44.06435
# 4 90.39080
# 5 66.77768
# 6 104.90591
#
# [10000 rows x 1 column]
It might also be worth looking at the models' h2o.residual_analysis_plot
. mod
looks pretty normal, but mod2
shows very strange patterns in the residuals. Not providing here, but different values of bs
also had inconsistencies and strange residuals.
Thanks for any help understanding what is happening!
Hi Paul:
I tried your code and was able to reproduce the problem. When you use fold column in cross-validation or specify a weight column in your parameter, the practice is to remove the fold/weight column in your data frame before calling prediction on it.
I removed the fold column in prediction with mod2 but the prediction output are still not the same as from mod. We have a bug in our code and I have opened a JIRA: https://h2oai.atlassian.net/browse/PUBDEV-8681.
The bs specifies the type of splines GAM use and hence will build different models with different values.
Thank you,
Wendy