Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • Nov 10 22:28
    haifengl commented #700
  • Nov 10 21:14
    afossa commented #700
  • Nov 10 21:13
    afossa commented #700
  • Nov 10 19:46

    haifengl on master

    SecurityManager is deprecated f… (compare)

  • Nov 10 19:34
    haifengl commented #701
  • Nov 10 19:15

    haifengl on master

    use scala.jdk.CollectionConvert… (compare)

  • Nov 10 19:11

    haifengl on master

    fix coefficients() (compare)

  • Nov 10 13:48
    oritush commented #701
  • Nov 09 16:46
    haifengl commented #700
  • Nov 09 13:28
    haifengl commented #700
  • Nov 09 13:13
    afossa commented #700
  • Nov 09 12:38
    haifengl closed #701
  • Nov 09 12:38
    haifengl commented #701
  • Nov 09 12:37
    haifengl commented #700
  • Nov 09 12:20
    oritush opened #701
  • Nov 09 09:04
    afossa opened #700
  • Nov 06 22:57
    haifengl closed #699
  • Nov 02 02:45
    tkorach commented on 146289d
  • Oct 25 14:43
    reynoldsm88 commented #699
  • Oct 22 18:41
    DirkToewe closed #693
Haifeng Li
@haifengl
@lukasbraach It is a DEBUG level log message. You see it only if your log level is debug or lower.
@MZazvor_gitlab sterr = true
rikima
@rikima_twitter
I noticed that SMILE is awesome ML library implemented via java/scala yesterday. so, I am planning to use this for our system which has ML based functionality.
I am studying this framework now, are there functions to get ROC /AUC metrics?
Can all classification algorithm output score as well as predicted label? 
Haifeng Li
@haifengl
@rikima_twitter you can find answers at http://haifengl.github.io/. There are API doc too.
For AUC (and many other metrics), checkout smile.validation.metric package. The classification algorithms report posteriori probabilities if they are SoftClassifier. Some algorithms also have a score() method, which is not necessarily probabilities though.
rikima
@rikima_twitter
@haifengl Thank you so much for your support. I will check the doc and the info. you suggested.
again, SMILE is awesome ML library! clean dataframe implementation, classification framework which enable to expand other algorithm easliy. so, I am thinking to implement missing algorithms like a factorization machine, field aware FM, ngboost, and so on
Haifeng Li
@haifengl
@rikima_twitter check out CrossValidation, Bootstrap, etc. in smile.validation. They calculate all the metrics automatically.
@rikima_twitter Look forward to your contributions of new algorithms. Thanks a lot in advance!
rikima
@rikima_twitter
Thanks!
I am looking for sparsearray.java now. I need sparse vector implementation for example based classifier like SVM or SparceLogisticRegression. I could not find out smile.util package in repository.
rikima
@rikima_twitter
in smile-math, there is smile.util package, in there, there is SparseArray.java
Ryan Bennett
@rwbennett

Hi, I have a question about the new(?) OLS.fit() method. If I try to use it to predict housing sale prices, like OLS.fit(Formula.lhs("SalePrice"), X_train_dataframe), it fails with "no response variable".

So it seems I can only use OLS if I pass it a dataframe that includes X and y together. However, if I use it like "OLS.fit(Formula.lhs("SalePrice"), training_dataframe), which includes both the dependent and independent columns (X and y), then predict requires an array of the same size, including a column for the value I wish to predict.

However, doing that and passing, say, 0 for the y value results in wildly incorrect predictions, and changing the value affects predictions. Is there not a way to use OLS without comingling X and y?

Anoukh Ashley
@anoukh_ashley_twitter
Hi. Is there a version of smile-mkl that can be used with scala 2.11 ? I can't seem to get sbt to download the dependency libraryDependencies += "com.github.haifengl" %% "smile-mkl" % "2.6.0"
Haifeng Li
@haifengl
@rwbennett the training data frame must have both X and y. but for prediction, it doesn't require y in the data frame. Make sure to use the latest version.
@anoukh_ashley_twitter smile-mkl is a pure java library. do libraryDependencies += "com.github.haifengl" % "smile-mkl" % "2.6.0"
Ryan Bennett
@rwbennett
@haifengl I'm using 2.6.0. My code (https://pastebin.com/FEUsXqpY) uses a dataframe with both X and y (21 columns), but when I try to predict using a 20-column double array, I get "java.lang.IllegalArgumentException: Invalid input vector size: 20, expected: 21". Shouldn't prediction work with a 20-column dataframe, given that I trained the model on a 21-column DF with Formula.lhs("SalePrices") ?
Haifeng Li
@haifengl
@rwbennett first of all, life is much easier to read data frame by smile's api. Why do go through tablesaw? for prediction, it is native to use Tuple/DataFrame. If you have to use double[], make sure to include the bias (aka 1) in your vector.
Ryan Bennett
@rwbennett
@haifengl , I'm using tablesaw because I'm following a somewhat poorly-explained Udemy course that uses tablesaw with smile. Thanks for the suggestion though; I'll look into using smile's dataframes directly instead. As for using the double array with a bias of 1, I assume you mean inserting a 1 at index 0 in the array? I'm not sure if the reason is that the array is assumed to be 1-indexed by smile, but at any rate, doing so gives a result that matches the example I'm following, so thank you very much for your help!
Haifeng Li
@haifengl
@rwbennett Smile is not 1-indexed. A linear model may or may not has a bias item. If you provide a tuple (e.g. dataframe.get(0)), smile will create the proper vector automatically. It is especially important when the data frame has categorical variables.
wholfy
@wholfy
@haifengl Hi, I would appreciate your advice.
I have a large data array, that needs to be partitioned into clusters, but I can't load it to RAM.
Is it possible to load the source array in a parts and uses GMeans so that the result is similar to the clustered source array?
Haifeng Li
@haifengl
@wholfy no, GMeans needs all the data.
hmf
@hmf
I am trying to cluster a large number of instances (60000). Trying hierarchical clustering fails because n(n-1)/2 exceeds the array's float length. Which algorithm would be the best to use? Do these avoid the O(N^2) space distance matrix: clarans, dbscan, minimum entropy clustering ?
Haifeng Li
@haifengl
@hmf these methods all works on large data
hmf
@hmf
@haifengl Thanks. Will try them on a 10k data-set.
lskowr
@lskowr
@haifengl Is Lanczos in Smile thread-safe (also on the native level)?
Murat Koptur
@mrtkp9993
Hello, I am newbie to Smile, I need to perform PCA and clustering but examples on http://haifengl.github.io/clustering.html didn't work for me.
I wrote following code but I'm getting LAPACK GESDD error:

        CLARANS<double[]> clusters = PartitionClustering.run(20, () -> CLARANS.fit(x, new EuclideanDistance(), 6, 10));

        PCA pca = PCA.fit(x);
        pca.setProjection(2);
        double[][] y = pca.project(x);

        Canvas plot = ScatterPlot.of(y, clusters.y, '-').canvas();
x is a double[][]
I need to graph PCA, and label each point with string
Murat Koptur
@mrtkp9993
var clusters = GMeans.fit(x, 5);
I am getting Index 0 out of bounds for length 0 for Gmeans.
Darren Wilkinson
@darrenjw

Hi all. I'm having trouble getting started with basic linear algebra operations in Smile. I wonder if someone could help? In particular, I don't think I understand how symmetric matrices work.

val m2 = matrix(c(3.0,1.0),c(1.0,2.0)) // create a matrix, which is symmetric
m2.isSymmetric // returns false
m2.cholesky() // fails

If I create a symmetric matrix, isSymmetric nevertheless returns false, so naturally, cholesky fails. Is there something I need to do to tell Smile that the matrix is symmetric? Thanks,

Haifeng Li
@haifengl
@darrenjw smile doesn't check if the matrix is symmetric by comparing element values. It is too slow and also depends on the epsilon. If you know that your matrix is symmetric, you should use SymmMatrix class.
Darren Wilkinson
@darrenjw
OK, thanks. My next question relates to QR decomposition. I may be doing something wrong, but the results are not what I would expect. For example, given something like:
val mat = matrix(c(3.0,3.5),c(2.0,2.0),c(0.0,1.0))
mat.qr().Q
returns a matrix with columns that are not orthonormal.
András Dippold
@adippold
Hi all. While upgrading my project to use the latest version of Smile, I noticed that the SVM (one-vs-one) does not return the probabilities associated with the predicted labels anymore. I use the probability value to accept decisions from the trained SVM only if it is reasonably sure in its decision. Looking at the sources, I found that within KernelMachine, a 'score' is computed and is used for assigning the appropriate labels. My question is: how can I compute the probability value from score?
András Dippold
@adippold
Found the answer - PlattScaling got moved out from the SVM class and is now available as a separate class.
Pierre Nodet
@pierrenodet
Hey @haifengl, just for you to be aware, Apache Spark are moving away from fommil net-lib to a newer implementation. You can follow the process which is in this pull request : apache/spark#32415. It could be interesting for smile to follow the same path for better integration. Breeze has done the same too.
I can open an issue btw if you want to.
Haifeng Li
@haifengl
@pierrenodet we moved away awhile back. netlib (and nd4j) modules are not used and will be removed.
Pierre Nodet
@pierrenodet
ok nice !
Adrian Le'Roy Devezin
@dri94
How do I create a sparsevector for a column in the dataframe? I have a dataframe of tweets. So I need all the words in in the "headline" column of my dataframe to be converted into a sparse vector instead (ex: 0, 1, 0... 1, 0, 0). I haven't been able to find clear documenation to do this. I see the vectorize method but am still not quite sure how to peace it all together.
Haifeng Li
@haifengl
@dri94 DataFrame is for dense tabular data. It is not designed for sparse data.
Adrian Le'Roy Devezin
@dri94
@haifengl do you have a recommendation on what I should do then?
Adrian Le'Roy Devezin
@dri94
nvm. I see there is a sparse dataset
Adrian Le'Roy Devezin
@dri94
Documenation says to serialize a model to disk we can do write.xstream(model, file) however this method is not available. How can I save my trained model to disk?
orion2107
@orion2107
Hello everyone, I'm kind of new to smile RandomForest model implementation and usage.
I have an issue on an already trained model which is when I'm using the model for predictions (calling model.predict API) I sometimes get NullPointerException.
This happens only during load test which I call to this method with 5-10 concurrent users.
Here is part of the log file I get:
java.lang.NullPointerException: null
at smile.data.formula.Formula$2.getDouble(Formula.java:358)
at smile.base.cart.OrdinalNode.predict(OrdinalNode.java:45)
at smile.classification.DecisionTree.predict(DecisionTree.java:361)
at smile.classification.RandomForest.predict(RandomForest.java:514)
Can anyone please share some light on this if you happen to notice this kind of behavior as well?
Thank you
Haifeng Li
@haifengl
@orion2107 this issue is fixed. you can build the master branch and try it with your code. Or you can just load the model separately in each thread as a work around.
orion2107
@orion2107
@haifengl Thank you very much for your answer, I truly appreciate it, can I ask in which smile release the fix was made? I'm currently using "com.github.haifengl" %% "smile-scala" % "2.5.2" with Scala 2.13.1.
I'm asking in order to understand if the release is updated with the fix
Thank you very much