Hi, we are having performance problems using LightGBMClassifier on pyspark running on AWS EMR. We have around 100 million rows with 19 features.
Running using native LGBM on a 8 cores EC2 instance using the dataset finishes in around 15 minutes. Running that same dataset on an EMR cluster (we have tried many different node configurations) take 30 mins+.
model = LightGBMClassifier(
Looking at the resource utilisation, it looks like the executors are severely under utilising CPU. We tried voting_parallel, though it did not seem to help. Any thoughts? or tips?
@calvin-pietersen we recently noticed this too in benchmarking, for some datasets and parameters we can get better performance/higher CPU utilization by creating a single dataset per executor. This new (complex) mode has been implemented here:
note it isn't always faster/uses more CPU, it only seems to be so on particular datasets (especially those with many columns) and parameter combinations
For accessing Microsoft cognitive services what is the correct link to fetch it.
This link (https://mmlspark.azureedge.net/maven/com/microsoft/cognitiveservices/speech/client-sdk/1.15.0/client-sdk-1.15.0.pom) says blob not found.
I tried using it through an sbt project.
val speechResolver = "Speech" at "https://mmlspark.azureedge.net/maven" settings( libraryDependencies ++= Seq("com.microsoft.cognitiveservices.speech" % "client-sdk" % "1.15.0"), resolvers += speechResolver, name := "project" )
It says with an error
[error] not found: https://mmlspark.azureedge.net/maven/com/microsoft/cognitiveservices/speech/client-sdk/1.15.0/client-sdk-1.15.0.pom
Kindly help me in figuring it out.