Where communities thrive

  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
    Mark Hamilton
    Hey everyone, welcome to the MMLSpark Gitter, a place for less formal discussion about the library!
    Nishant Arora
    Hey everyone
    Can someone tell me that around when will the next version of MMLSpark be released ?
    Yuan Tang
    Hello everyone :-)
    Karl D. Gierach
    Greetings! just curious about running VW and training with data in traditional VW format? is it possible to do this? If so, a code snippet is appreciated.
    Markus Cozowicz
    @kgierach we haven't created a input format parser yet, but it's a good suggestion (maybe add as feature request on github?). If you can drop the namespace information, the libsvm format should work.
    Karl D. Gierach
    @eisber thanks for your reply. glad to see this space is active. I will add that as a feature request. A benefit of building models directly from the VW native format will result in coefficient names being prefixed with the namespaces, thus easing interpretation.
    Thai Thien
    do you have Spark Scala code example in LightGBM ?
    Mark Hamilton
    Hey @ttpro1995 sorry for the delay, didnt realize so many folks were using this. You can check out our test code for examples in src/test/scala/
    Calvin Pietersen

    Hi, we are having performance problems using LightGBMClassifier on pyspark running on AWS EMR. We have around 100 million rows with 19 features.

    Running using native LGBM on a 8 cores EC2 instance using the dataset finishes in around 15 minutes. Running that same dataset on an EMR cluster (we have tried many different node configurations) take 30 mins+.

    model = LightGBMClassifier( learningRate=0.1, numIterations=100, numLeaves=360, parallelism='voting_parallel', categoricalSlotIndexes=list(range(len(columns))), timeout=12000.0).fit(data)

    Looking at the resource utilisation, it looks like the executors are severely under utilising CPU. We tried voting_parallel, though it did not seem to help. Any thoughts? or tips?

    Calvin Pietersen
    Hi, I am trying to use MMLSpark with saprk3.0.0
    When I was trying to import jars, I am getting Avro error while reading CSV
    Can anyone please help?
    Hu Dong
    @calvin-pietersen I got a similar issue with LightGBMRanker. It only takes several minutes to train a model with the native LightGBM binary, however mmlspark would take 20hrs to train a model with the same parameters. From the spark log, at each iteration the function call "LGBM_BoosterUpdateOneIter " takes 7~10 minutes to finish. I've no clue why it takes so long.
    Seems like this chat is empty, but wondering if any one has had any experience with disabling feature shape checks for prediction? Doesn't look like we can adjust that parameter given the current state of the library
    Sudarshan Raghunathan
    Hello @gabrielwomark could you please open an issue on GitHub for this?
    Anyone here? How would I install this on Spark 3?
    Ilya Matiach
    @ryanbbrownavanade you can install from master, which supports spark 3
    for example see my reply on this issue:

    @calvin-pietersen we recently noticed this too in benchmarking, for some datasets and parameters we can get better performance/higher CPU utilization by creating a single dataset per executor. This new (complex) mode has been implemented here:


    note it isn't always faster/uses more CPU, it only seems to be so on particular datasets (especially those with many columns) and parameter combinations

    Nafis Sadat
    Hey folks, is the latest JAR upload sort of wonky/broken? The JAR itself looks to be 331 bytes in size (https://mmlspark.azureedge.net/maven/com/microsoft/ml/spark/mmlspark/1.0.0-rc3-106-84f96e9a-SNAPSHOT/mmlspark-1.0.0-rc3-106-84f96e9a-SNAPSHOT.jar)
    1 reply
    I got this bash script to scrape the SVG from the Github page and pull the latest version of the JAR file into my Spark cluster everytime, and also do a check on the file size in case I'm not pulling an empty page/file. And I just saw that the latest JAR is being sad.. Az Pipelines looks to have built successfully though
    Nithishkumar K R

    Hi everyone,
    For accessing Microsoft cognitive services what is the correct link to fetch it.
    This link (https://mmlspark.azureedge.net/maven/com/microsoft/cognitiveservices/speech/client-sdk/1.15.0/client-sdk-1.15.0.pom) says blob not found.

    I tried using it through an sbt project.

    val speechResolver = "Speech" at "https://mmlspark.azureedge.net/maven"
        libraryDependencies ++= Seq("com.microsoft.cognitiveservices.speech" % "client-sdk" % "1.15.0"),
        resolvers += speechResolver,
        name := "project"

    It says with an error [error] not found: https://mmlspark.azureedge.net/maven/com/microsoft/cognitiveservices/speech/client-sdk/1.15.0/client-sdk-1.15.0.pom

    Kindly help me in figuring it out.