Where communities thrive

  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
    Mark Hamilton
    Hey everyone, welcome to the MMLSpark Gitter, a place for less formal discussion about the library!
    Nishant Arora
    Hey everyone
    Can someone tell me that around when will the next version of MMLSpark be released ?
    Yuan Tang
    Hello everyone :-)
    Karl D. Gierach
    Greetings! just curious about running VW and training with data in traditional VW format? is it possible to do this? If so, a code snippet is appreciated.
    Markus Cozowicz
    @kgierach we haven't created a input format parser yet, but it's a good suggestion (maybe add as feature request on github?). If you can drop the namespace information, the libsvm format should work.
    Karl D. Gierach
    @eisber thanks for your reply. glad to see this space is active. I will add that as a feature request. A benefit of building models directly from the VW native format will result in coefficient names being prefixed with the namespaces, thus easing interpretation.
    Thai Thien
    do you have Spark Scala code example in LightGBM ?
    Mark Hamilton
    Hey @ttpro1995 sorry for the delay, didnt realize so many folks were using this. You can check out our test code for examples in src/test/scala/
    Calvin Pietersen

    Hi, we are having performance problems using LightGBMClassifier on pyspark running on AWS EMR. We have around 100 million rows with 19 features.

    Running using native LGBM on a 8 cores EC2 instance using the dataset finishes in around 15 minutes. Running that same dataset on an EMR cluster (we have tried many different node configurations) take 30 mins+.

    model = LightGBMClassifier( learningRate=0.1, numIterations=100, numLeaves=360, parallelism='voting_parallel', categoricalSlotIndexes=list(range(len(columns))), timeout=12000.0).fit(data)

    Looking at the resource utilisation, it looks like the executors are severely under utilising CPU. We tried voting_parallel, though it did not seem to help. Any thoughts? or tips?

    Calvin Pietersen
    Hi, I am trying to use MMLSpark with saprk3.0.0
    When I was trying to import jars, I am getting Avro error while reading CSV
    Can anyone please help?
    Hu Dong
    @calvin-pietersen I got a similar issue with LightGBMRanker. It only takes several minutes to train a model with the native LightGBM binary, however mmlspark would take 20hrs to train a model with the same parameters. From the spark log, at each iteration the function call "LGBM_BoosterUpdateOneIter " takes 7~10 minutes to finish. I've no clue why it takes so long.
    Seems like this chat is empty, but wondering if any one has had any experience with disabling feature shape checks for prediction? Doesn't look like we can adjust that parameter given the current state of the library
    Sudarshan Raghunathan
    Hello @gabrielwomark could you please open an issue on GitHub for this?
    Anyone here? How would I install this on Spark 3?
    Ilya Matiach
    @ryanbbrownavanade you can install from master, which supports spark 3
    for example see my reply on this issue:

    @calvin-pietersen we recently noticed this too in benchmarking, for some datasets and parameters we can get better performance/higher CPU utilization by creating a single dataset per executor. This new (complex) mode has been implemented here:


    note it isn't always faster/uses more CPU, it only seems to be so on particular datasets (especially those with many columns) and parameter combinations

    Nafis Sadat
    Hey folks, is the latest JAR upload sort of wonky/broken? The JAR itself looks to be 331 bytes in size (https://mmlspark.azureedge.net/maven/com/microsoft/ml/spark/mmlspark/1.0.0-rc3-106-84f96e9a-SNAPSHOT/mmlspark-1.0.0-rc3-106-84f96e9a-SNAPSHOT.jar)
    1 reply
    I got this bash script to scrape the SVG from the Github page and pull the latest version of the JAR file into my Spark cluster everytime, and also do a check on the file size in case I'm not pulling an empty page/file. And I just saw that the latest JAR is being sad.. Az Pipelines looks to have built successfully though
    Nithishkumar K R

    Hi everyone,
    For accessing Microsoft cognitive services what is the correct link to fetch it.
    This link (https://mmlspark.azureedge.net/maven/com/microsoft/cognitiveservices/speech/client-sdk/1.15.0/client-sdk-1.15.0.pom) says blob not found.

    I tried using it through an sbt project.

    val speechResolver = "Speech" at "https://mmlspark.azureedge.net/maven"
        libraryDependencies ++= Seq("com.microsoft.cognitiveservices.speech" % "client-sdk" % "1.15.0"),
        resolvers += speechResolver,
        name := "project"

    It says with an error [error] not found: https://mmlspark.azureedge.net/maven/com/microsoft/cognitiveservices/speech/client-sdk/1.15.0/client-sdk-1.15.0.pom

    Kindly help me in figuring it out.

    Hey folks, is there anyone tried to build mmlspark on your own desktop ? some unit tests could not pass for the LightGBM modules, such as com.microsoft.ml.spark.lightgbm.split1.VerifyLightGBMClassifier#"Verify LightGBM Classifier with validation dataset",i just wondering whether the unit test is reproducible?
    Thai Thien
    Hi bro, do you know how to start pyspark behind firewall ?

    import os
    os.environ["SPARK_LOCAL_IP"] = "localhost"
    os.environ["JAVA_HOME"] = "/server/java/jdk1.8.0_192-x64"
    os.environ["HADOOP_USER_NAME"] = "rnd"
    os.environ["HADOOP_CONF_DIR"] = "/server/hadoop-"
    os.environ["HADOOP_HOME"] = "/server/hadoop-"
    os.environ["SPARK_HOME"] = "/data/xxxx/share/spark-3.0.1-bin-hadoop3.2/"

    import findspark

    from pyspark.sql import SparkSession, SQLContext
    import pyspark.sql.functions as f

    spark = SparkSession.builder.appName("synap-notebook").master("local[10]")\
    .config("spark.ui.enabled", False)\
    .config("spark.driver.memory", "100g")\
    .config("spark.local.dir", "/data/xxxx/tmp/")\
    .config("spark.yarn.queue", "rnd")\
    .config("fs.defaultFS", "hdfs://nnp.v3.h2.xxxx:8020")\

    Exception: Java gateway process exited before sending its port number
    I think i set my proxy incorrectly so it can't fetch jar pack
    do anyone have working example ?
    George Fei
    Hi everyone, which VW versions are currently supported by SynapseML?
    George Fei
    another question: one of the limitations for VowpalWabbit on Spark is that it requires centos. is it compatible with rocky linux?
    Olivier Daneau
    Hi everyone, we are trying to perform offline install of Synapse ML 0.9.5 on Spark and having issue with "'JavaPackage' object is not callable". Seems we are missing some sub-dependencies. Is this a known issue?
    Mark Hamilton
    @luoguohao https://microsoft.github.io/SynapseML/docs/reference/developer-readme/ this will help you set up local build on desktop!
    @ttpro1995 , hhhmm im not sure though our development setup posted above can help you get started. Also pip install puspark can often get you 90% of the way there but beware its not the full spark backend
    Jeffrey Dang
    Hi all, I'm trying to learn more about the project setup so I can contribute. As of now I'm running into issues with the just the tests. As far as azure is concerned, is there an offline mode for testing or is azure a hard requirement?