Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • Jun 19 2018 00:00
    @ledell banned @renatomarinho
razou
@razou
This create a h2o context. And where and when are you calling the H2O xgb method ?
Jordan Bentley
@jbentleyEG
I think I figured it out sparkConfigIn[["spark.ext.h2o.client.extra"]] <- "max_mem_size=\"20G\""
XGB gets called from Scala
I pass the H2O context into my Scala library with sparklyr
my script is running now with the max_mem_size set, I'll let you know what happens
it takes ~5-10 minutes to get to the XGB training
Jordan Bentley
@jbentleyEG
it doesn't seem to like that option, H2O won't even start
razou
@razou
Ok, can you provide a code snipet
Jordan Bentley
@jbentleyEG

```I start spark with:

config[["spark.dynamicAllocation.enabled"]] <- FALSE
config[["spark.executor.extraJavaOptions"]] <- paste0("-XX:+UseG1GC')
config[["spark.driver.extraJavaOptions"]] <- paste0("-XX:+UseG1GC')

config[["spark.r.command"]] <- paste0(R.home("bin"), "/Rscript")
config[["spark.sql.crossJoin.enabled"]] <- TRUE
config[["spark.driver.maxResultSize"]] <- "10G"
config[["spark.executor.maxResultSize"]] <- "10G"
config[["spark.executor.cores"]] <- 5
config[["spark.driver.memory"]] <- "15G"
config[["spark.executor.memory"]] <- "15G"
config[["spark.network.timeout"]] <- "1024s"

...

sparkConfigIn[["spark.locality.wait"]] <- 3000
sparkConfigIn[["spark.task.maxFailures"]] <- 1
sparkConfigIn[["spark.scheduler.minRegisteredResourcesRatio"]] <- 1
sparkConfigIn[["spark.executor.heartbeatInterval"]] <- "10s"

sparkConfigIn[["spark.driver.extraJavaOptions"]] <- "-XX:MaxPermSize=384m"
sparkConfigIn[["spark.executor.extraJavaOptions"]] <- "-XX:MaxPermSize=384m"
sparkConfigIn[["spark.ext.h2o.client.extra"]] <- "max_mem_size=\"10G\""

...

spark <- spark_connect(master=sparkMasterURL,
spark_home = sparkHome, config=config)
h2o <- rsparkling::h2o_context(spark)```

I misspoke earlier, it has 244GB of memory not 32
it has 32 cores
Divya Mereddy
@DivyaMereddy007
@jbentleyEG , thank for updating about the issue. I was not able to resolve this issue yet. I will update if we got any update.
razou
@razou
@DivyaMereddy007 @jbentleyEG
h2o_context(sc) does this command show the amond of memory available in order to see if the config wa taken into account ?
Divya Mereddy
@DivyaMereddy007
image.png
image.png
it just show like this
Jordan Bentley
@jbentleyEG
Screen Shot 2019-12-09 at 11.17.37 AM.png
I do have this screen for memory allocation
I only have half the machines memory allocated in spark, but somehow I see from running htop on the console that it is all getting consumed about 10 minutes into the training
razou
@razou
I think you did not specified the number of total executors to use (spark.executor.instances) in order to use the maximum of ressource since the dynamic allocation is set to False
Jordan Bentley
@jbentleyEG
I'll try setting that, although doesn't it auto-detect if it isn't set?
razou
@razou

Here is an example of setting I using with spark, H2O and It is working fine

Example, I ‘m using worker (master) nodes with 48 CPUs and 384GB of memory

#!/usr/bin/env bash


NB_CPU_PER_NODE=48
NODE_MEMORY=384
YARN_NODEMANAGER_RESSOURCE_MEMORY=378 # NODE_MEMORY - 6 GB or plus for the system os ...



NB_CPU_PER_EXECUTOR=5 #=> Number of CPU per Executor
NB_EXECUTOR_PER_NODE=$(( (NB_CPU_PER_NODE - 1) / NB_CPU_PER_EXECUTOR))
NB_EXECUTORS_MAX=$((NB_EXECUTOR_PER_NODE * NB_WORKERS - 1))  # -1 for the application master on one node
EXECUTOR_MEMORY_PLUS_OVERHEAD_MEMORY=$((YARN_NODEMANAGER_RESSOURCE_MEMORY / NB_EXECUTOR_PER_NODE))
EXECUTOR_MEM_OVERHEAD=$(( EXECUTOR_MEMORY_PLUS_OVERHEAD_MEMORY * 15/100 ))
EXECUTOR_MEMORY=$((EXECUTOR_MEMORY_PLUS_OVERHEAD_MEMORY - EXECUTOR_MEM_OVERHEAD))

DRIVER_MEMORY=${EXECUTOR_MEMORY}g
DRIVER_MEM_OVERHEAD=${EXECUTOR_MEM_OVERHEAD}
DRIVER_CORE=$NB_CPU_PER_EXECUTOR



printf "\n---------------------------------\n"
echo "Settings"
echo "---------------------------------"

echo "NB_EXECUTORS_MAX=$NB_EXECUTORS_MAX"
echo "NB_CPU_PER_EXECUTOR=$NB_CPU_PER_EXECUTOR"
echo "NB_EXECUTOR_PER_NODE=$NB_EXECUTOR_PER_NODE"
echo "EXECUTOR_MEMORY_PLUS_OVERHEAD_MEMORY=${EXECUTOR_MEMORY_PLUS_OVERHEAD_MEMORY}g"
echo "EXECUTOR_MEMORY=${EXECUTOR_MEMORY}g"
echo "EXECUTOR_MEM_OVERHEAD=${EXECUTOR_MEM_OVERHEAD}g"
echo "DRIVER_MEMORY=${DRIVER_MEMORY}"
echo "DRIVER_MEM_OVERHEAD=${DRIVER_MEM_OVERHEAD}g"
echo "DRIVER_CORE=$DRIVER_CORE"

Here is an example on how to get the right number of executor based on the executor memory and executor cpu

Jordan Bentley
@jbentleyEG
spark.executor.instances is only a setting for Yarn, I am running in standalone (no EMR)
I'll try adjusting memory based on your script though
razou
@razou

spark.executor.instances is only a setting for Yarn, I am running in standalone (no EMR)

This spark property like driver_memory ... not related to yarn/emr/ ...

Divya Mereddy
@DivyaMereddy007
@razou , I gave the number of excutors based on AWS best practices but I was not able to run the model
razou
@razou
May be trying with gbm model (less expensive) instead of xgb with sample_rate in [0.5, …, 0.8] could help
Divya Mereddy
@DivyaMereddy007
I am running Coxph model not GBM. when I tried running it with small dataset i worked
But with full dataset its not working
razou
@razou
@DivyaMereddy007 @jbentleyEG it could be good to put the whole code up the model usage and the configuration in gist or equivalent to allows us to get better view and find together solutions
And also: cluster size, i.e number of machines, their memory and cpu
Divya Mereddy
@DivyaMereddy007

below is my sample code. library(survival)
library(MASS)
library(h2o)
library(rsparkling)
library(sparklyr)
library(dplyr)

Sys.setenv(SPARK_HOME="/usr/lib/spark")
conf <- spark_config()

conf$spark.executor.instances <- 171
spark.yarn.executor.memoryOverhead<- 2048
conf$spark.executor.memory <- "18g"
conf$spark.executor.cores <- 5

spark.yarn.driver.memoryOverhead<- 39936
conf$spark.driver.memory<-"57.6g"
conf$spark.driver.cores<- 5

conf$'sparklyr.shell.executor-memory' <- "32g"
conf$'sparklyr.shell.driver-memory' <- "32g"
conf$spark.yarn.am.memory <- "32g"
conf$spark.dynamicAllocation.enabled <- "false"

sc <- spark_connect(master = "yarn-client", version = "2.4.3",config = conf)
data=spark_read_parquet(sc,path ="s3aparquetfile")

TRanform data

mydata2<-mydata

data_modelt_Full_Sample<-mydata%>%mutate(arr = explode(array_repeat(column_1,column1)))

H2o Setup

library(h2o)
library(rsparkling)
h2o_context(sc)

h2o setup end

Full_Data<-as_h2o_frame(sc,data_modelt_Full_Sample)

H2o tranform factors

Full_Data$column_2 <- as.factor(Full_Data$ column_2)
Full_Data$column_3 <- as.factor(Full_Data$ column_3)
.
.
.
.
.upto 270 columns

End

Full_Data$strata<-as.factor(Full_Data$ strata)

Coxph h2o model

predictorsSt <- c( column_1,column_2,column_3,….. up to 270 columns
)

h2o_modelt_D1_p <- h2o.coxph(
x = predictorsSt,
event_column = " event_column ",
start_column = " start_column ",
stop_column = " stop_column ",
offset_column = " offset_column ",
ties = c("breslow"),
stratify_by = ("strata"),
interaction_pairs=list(
c("X", "X1"),
c("X", "X2"),
c("X3","X"),
c("X3","X2")
), training_frame = Full_Data)

coefficients_Stage_1<-h2o_modelt_D1_p@model$coefficients_table$coefficients

i have also tried different spark configurations along with mentioned one
razou
@razou
this is not quit readable here :)
gist or something shareble could be better
Divya Mereddy
@DivyaMereddy007
here is the R sample code. I was not able to include the exact code because of security issues
Let me know if you need any other details please
razou
@razou
Thanks @DivyaMereddy007 this is enough. I’ll look at it an let you know if I found something which could causes this issue
Jordan Bentley
@jbentleyEG
my training just completed successfully with updated settings
config[["spark.driver.maxResultSize"]] <- "10G" config[["spark.executor.maxResultSize"]] <- "10G" config[["spark.executor.cores"]] <- 5 config[["spark.executor.instances"]] <- 5 config[["spark.driver.cores"]] <- 5 config[["spark.driver.memory"]] <- "54G" config[["spark.executor.memory"]] <- "54G" config[["spark.network.timeout"]] <- "1024s"
thanks for your help!
Hopefully them finishing wasn't a fluke, I will let you know
Jordan Bentley
@jbentleyEG
I spoke too soon, it doesn't seem to be working consistently
Jordan Bentley
@jbentleyEG
From the spark log: terminate called after throwing an instance of 'std::bad_alloc'
Divya Mereddy
@DivyaMereddy007
This message was deleted
Jordan Bentley
@jbentleyEG
I was able to train a comparable deep learning model
razou
@razou
Cool, happy for you ;)
razou
@razou

Hi @DivyaMereddy007 , I looked at to your code but It’s like your are using only R instead of RSparklink Water and the second thing I noticed is that you have a lot of dimensions (SUM(Number of levels of all categorical variables) + Number of numerical variables
I didn’t see where you used H2O or RSparkling or Sparklyr ….

(Doc: http://docs.h2o.ai/sparkling-water/2.2/latest-stable/doc/rsparkling.html)

I think your problem is that you are using R (not very scalable ) which always try to load all the data into memory plus the fact that you have a lot of categorical features increasing the dimentionality of your problem

Divya Mereddy
@DivyaMereddy007
Hi @Razou, I apologize for the inconvenience, I uploaded a wrong file. Please find the updated one here.
Jordan Bentley
@jbentleyEG
I tried going back to XGBoost and significantly reducing my data size to about 10%, it still wouldn't build and managed to fill my machine's memory