Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • Jun 19 2018 00:00
    @ledell banned @renatomarinho
Georgios Kourogiorgas
@gkourogiorgas
Hi all. I have a python code that does h2o.init() and loads a saved model and creates a data frame to run predict every hour. Does it replace the model and data frame in memory or does it create a new record?
Zachary Raicik
@zakraicik

I am using h2o's AutoML tool. The h2o frame I am using has a few columns that contain categorical values. For example, the gender column contains male and female.

When I pass the frame to the autoML tool, h2o is automatically encoding these features. The way they are encoded varies depending on what model is being used(Different encoding methods).

Is it possible to access the encoded frame? If I call the frame, I see the version that does not have the encoding.

Charley Stran
@charleystran
This is what I do when I import
column_types = ["enum","enum","enum","enum","enum","enum","enum","numeric","numeric","numeric","numeric","numeric","numeric","numeric","numeric","numeric","numeric","numeric","numeric","numeric","numeric","numeric","numeric","numeric","numeric","numeric","numeric","numeric","numeric","numeric","numeric","numeric"]
data_frame = h2o.import_file(data_path, col_types=column_types)
not sure if that helps you or not, but I have some categorical numerical data
Zachary Raicik
@zakraicik
Hey- thanks for the answer but it's not quite what I am looking for
Basically, I am looking for a way to see what encodings AutoML is making on it's own. I want to match feature values to shap values after the models are done training
Michal Kurka
@michalkurka

@zakraicik AutoML is using the default (AUTO) encoding for each model - for GBM/DRF that is "Enum" encoding - H2O's way to represent the splits and for XGBoost and GLM internal 1-hot encoding

there should be no issue matching features to their shap values, can you please provide an example?

@valkyrias_gitlab did you also check the logs on the nodes? and the spark log? was an executor lost?
Michal Kurka
@michalkurka

When running h2o automl after setting sort_metric='auc' I am getting this error "Failed to find ModelMetrics for criterion: auc" has anyone had a similar error/issue?

@mjorgen1 yes, I've seen this one before - and we were never able to reproduce it or find out why it was happening - we would appreciate if you could provide any information - logs, possibly dataset (if publicly available)

Zachary Raicik
@zakraicik
@michalkurka I think the problem stems from the internal 1-hot encoding. It's my understanding that these splits are made on the fly, but the frame is not actually modified. However, when calculating shap values- they are output using the columns that were created on the fly. This means if I call the frame, I will only see one column for gender but the SHAP values would contains two columns - gender.Male and gender.Female. and as a result, you can't feature names to shap values. I am away from my desk but can create an example when I'm back.
Zachary Raicik
@zakraicik
@michalkurka I attached a dummy example of the SHAP problem in h2o. Can you take a look and let me know how if it's possible to match SHAP feature names to actual feature values. In my attached example, you'll see the frame columns don't match the SHAP output columns. Thanks!
Michal Kurka
@michalkurka
@zakraicik okay, I see the problem now - the issue is with XGBoost - it outputs the 1-hot encoded frame in the TreeSHAP function; we can absolutely fix (add an option to just return the original features - not the expanded frame)
Zachary Raicik
@zakraicik
@michalkurka awesome, thanks! Is it possible to access the 1-hot encoded frame in the mean time? I know for my example, I can sum the shap values of the 1-hot encoded frame to recreate the original frame, but I am looking for something that generalizes no matter how many features get encoded.
Michal Kurka
@michalkurka

R has a (private, non-exported) function .getExpanded: https://github.com/h2oai/h2o-3/blob/master/h2o-r/h2o-package/R/frame.R#L312

(this is used only in our testing), you can still get to it though

there is no such thing in Python (to my knowledge)

IMHO the best solution is to just collapse the frame with shapley values to the original structure

in the output frame you will have columns like SEX.female, SEX.male, SEX.NA - you can just sum-up the 3 columns together (only one value will be non-zero) to get a SEX column

Zachary Raicik
@zakraicik
Yeah. For context, I am trying to add this SHAP output to an automated workflow. The number of expanded columns could vary depending on function arguments.
Is there a way to access a list of columns that was expanded by the leader model so I can tell the function which ones to collapse?
Michal Kurka
@michalkurka
I see, well - this could be addressed in next H2O fix release - please make a jira here: https://0xdata.atlassian.net/projects/PUBDEV/issues
there is no really straightforward (no coding solution) imho
Zachary Raicik
@zakraicik
Thanks. I will make a jira
Cheng WeI
@valkyrias_gitlab
@michalkurka HI Michal, no spark executor was lost. And the node log didn't mention any error. I re-run the model with no CV and it's not giving me warning anymore. I did 20 rounds of random search and no warning either. Might be caused by CV.
Before I was using 3 fold CV and the warning showed up at the 2nd CV for the 3rd model
Michal Kurka
@michalkurka

very interesting observation - I cannot explain this behavior

with GPUs, we are doing something that would explain this - but I assume you are not running on GPUs, correct?

Cheng WeI
@valkyrias_gitlab
No I am not using GPUs
For same data and same cluster size it's not really reproducible, it only happens a few times this week
Zachary Raicik
@zakraicik
@michalkurka Above (CondenseExpandedColumns) is a function you can use to condense the expanded columns in the SHAP output. It will allow a user to match feature values to SHAP values.
Michal Kurka
@michalkurka
@zakraicik awesome - very cool! thank you for sharing
Jordan Bentley
@jbentleyEG
Has anyone tried running Sparkling Water in the GRAAL VM?
I have it running now, and so far so good
might move it into production soon
some of the other code we have running in Spark depends on it
Michal Kurka
@michalkurka
I am not aware of SW in particular but my colleague Pavel experimented with standalone H2O - I will ask him to share details here
@jbentleyEG do you see performance benefit in your deployment?
Michal Kurka
@michalkurka

I just did a quick test and I’ve seen a similar performance (speed) of latest GraalVM on most datasets compared to JDK8

There was one dataset where Graal did very poorly in terms of training time

Michal Kurka
@michalkurka
(community edition was tested)
Jordan Bentley
@jbentleyEG
@michalkurka I haven't benchmarked it, but I will at some point
right now I'm mostly concerned about stability
do you know what the dataset that performed poorly was? What kind of model were you building?
Michal Kurka
@michalkurka
Zachary Raicik
@zakraicik

@michalkurka Hey- I have a h2o frame called 'hf'. I use the command below to create my training and validation frames.

'train, valid = hf.split_frame(ratios = [0.8])'

However, if I use h2o.frames() I will not see the frame_id for train and valid. Instead, I can see the frame_id for the splitter used to create train and valid.

If I call train or valid (i.e. train.shape), I can see the frames in h2o.frames(). Is there a way to use h2o.get_frames() to get either of these frames without having to call them after creating them ?

Zachary Raicik
@zakraicik
Here is a simple example of the problem I am trying to describe.
Honza Sterba
@honzasterba
looks like a bug but its actually not a bug, but it has to do with lazy evaluation of the split_frame until the created frames are actually needed
Zachary Raicik
@zakraicik
@honzasterba yeah- I don't think it's a bug. More so just asking if there is a way to disable the lazy evaluation so I can store the frame ID's in one place and access them in another as long as the instance remains connected.
Simon Schmid
@SimonSchmid
Hello,
I have seen that there is an SVD implementation of h2o available but I cannot find documentation for it. It is not even listed on this site http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science.html. What's the reason for this?
angela0xdata
@angela0xdata
@SimonSchmid SVD is still experimental, and we do not have regular tests for it. You cannot configure it in Flow, but you can test in out using R or Python:
Honza Sterba
@honzasterba
@zakraicik the best way would be to use destination_frames arg of hf.split_frame(), this way you can set the IDs on creation time
razou
@razou
Hi, I wanted to know what was the difference between these two parameters: “balance_classes” and “weights_column” ? and which one better handle the imbalance problem if there was any difference? Thanks