Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • Jan 21 16:18
    sven-h closed #708
  • Jan 21 16:18
    sven-h commented #708
  • Jan 21 16:11
    sven-h commented #707
  • Jan 21 15:37
    haifengl commented #708
  • Jan 21 15:33
    haifengl commented #707
  • Jan 17 11:40
    sven-h opened #708
  • Jan 17 08:10
    sven-h opened #707
  • Jan 15 17:51
    haifengl commented #706
  • Jan 15 17:51

    haifengl on master

    fix test data path using file separator Merge pull request #706 from ra… (compare)

  • Jan 15 17:51
    haifengl closed #706
  • Jan 14 03:29
    rayeaster commented #706
  • Jan 14 03:28
    rayeaster synchronize #706
  • Jan 13 18:33
    haifengl commented #706
  • Jan 13 06:31
    rayeaster commented #706
  • Jan 12 14:34
    haifengl commented #706
  • Jan 12 09:07
    rayeaster opened #706
  • Jan 07 22:43

    haifengl on master

    update NNS usage (compare)

  • Jan 07 20:45

    haifengl on master

    refactor LinearSearch<K,V> refactor BKTree<K,V> refactor CoverTree<K,V> (compare)

  • Jan 07 17:16

    haifengl on master

    revise javadoc refine NNS javadoc and tests (compare)

  • Jan 07 02:58

    haifengl on master

    merge NearestNeighborSearch int… (compare)

Dylan Kane
@dmkaner
@haifengl ?
Pierre Nodet
@pierrenodet
@dmkaner In the official documentation there is a button to chose between scala, java or kotlin for code examples
Dylan Kane
@dmkaner
@pierrenodet Thanks, but I was looking for something with a little more context maybe? Those examples were pretty limited
Haifeng Li
@haifengl
@dmkaner check out unit tests
Dylan Kane
@dmkaner
@haifengl thanks for the response Haifeng. Where can I find these?
Lukas Braach
@lukasbraach
@haifengl One more question regarding Random Forest: Does smile's decision tree implementation adapt to the input fields measure, e. g. Categorical vs. Numerical scale?
implisci
@implisci
Hello @haifengl Are you considering something like this https://github.com/gpu/JOCLSamples/tree/master/src/main/java/org/jocl/samples for algorithms that can benefit from GPU? Or is there something else (jcuda?)
Haifeng Li
@haifengl
@dmkaner it is in the same code base
@lukasbraach yes
@implisci we will leverage gpu
Dylan Kane
@dmkaner
Anyone know why smile.io can't be found when using smile 2.5.3 with Maven in Java?
Also thanks for the response @haifengl
Dylan Kane
@dmkaner
^ above question not pertinent, just add smile.io code from GitHub repository into some local classes if anyone else has the same issue.
Haifeng Li
@haifengl
@dmkaner smile.io is in its own package (smile-io)
Nino
@weinino
Hi @haifengl
I have a situation:
1.) I've trained a random forest with a DataFrame object (target, feature 1, ... , feature 5).
2.) I would like to use RandomForest::predict with a new sample (production), that I get as double[5] {1,2,3,4,5}.
3.) I generate a Tuple t with schema (feature 1, ..., feature 5) and data [1,2,3,4,5]
4.) When I run predict(t) an array out of bounds exception occures. The problem is, that it tries to access feature 5 with index 5. This was correct in the schema of the dataset but not for the schema of the tuple t.
I would have expected, that in predict, it would access the data for Feature 5 from the tuple t on index 4.
Is there something I'm doing fundamentally wrong or might there be some inconsistencies? I couldn't help myself, since all cases I found in some sort of documentation predict on a tuple coming from the original DataFrame.
Thank you :-)
Nino
@weinino
Two solutions seem possible in my opinion:
1.) Bind the new schema (feature 1, ... , feature 5).I've tried this, but since "response != null" in the RF's formula I get a NPE. So, is there any support for execution samples without targets?
2.) I artificially change my sample to include some dummy labels, s.t. the targets have a value aswell and the schema would correspond again to the training case.
Haifeng Li
@haifengl
#2 should work. option #1 should work too with v2.5.3. Which version are you using?
Nino
@weinino
Ok, thank you!
I'll try #1 first, after updating to v2.5.3. At the moment I'm on v2.3.0
Haifeng Li
@haifengl
On v2.5.3, you don't need to bind the schema manually. Smile handles it automatically.
Nino
@weinino
@haifengl I've updated the version and now all looks fine. Thank you :-)
Lukas Braach
@lukasbraach
Hey @haifengl, my Random Forest model is working (mostly) as expected, thank you! I still have one question: On the first inference with the freshly trained model, smile logs The response variable Classification doesn't exist in the schema [...]. Should I pay attention to this message? How do I get rid of it?
Martin Zazvorka
@MZazvor_gitlab
Hi, is there a possibility to compute OLS without additional statistics ? We are using this for a application where statistics have to be computed anyway separately so it would mean major performance improvement. I have read somewhere In comparison with apache, that OLS stats take significant part of computational time. Thanx for any hint. Martin
Haifeng Li
@haifengl
@lukasbraach It is a DEBUG level log message. You see it only if your log level is debug or lower.
@MZazvor_gitlab sterr = true
rikima
@rikima_twitter
I noticed that SMILE is awesome ML library implemented via java/scala yesterday. so, I am planning to use this for our system which has ML based functionality.
I am studying this framework now, are there functions to get ROC /AUC metrics?
Can all classification algorithm output score as well as predicted label? 
Haifeng Li
@haifengl
@rikima_twitter you can find answers at http://haifengl.github.io/. There are API doc too.
For AUC (and many other metrics), checkout smile.validation.metric package. The classification algorithms report posteriori probabilities if they are SoftClassifier. Some algorithms also have a score() method, which is not necessarily probabilities though.
rikima
@rikima_twitter
@haifengl Thank you so much for your support. I will check the doc and the info. you suggested.
again, SMILE is awesome ML library! clean dataframe implementation, classification framework which enable to expand other algorithm easliy. so, I am thinking to implement missing algorithms like a factorization machine, field aware FM, ngboost, and so on
Haifeng Li
@haifengl
@rikima_twitter check out CrossValidation, Bootstrap, etc. in smile.validation. They calculate all the metrics automatically.
@rikima_twitter Look forward to your contributions of new algorithms. Thanks a lot in advance!
rikima
@rikima_twitter
Thanks!
I am looking for sparsearray.java now. I need sparse vector implementation for example based classifier like SVM or SparceLogisticRegression. I could not find out smile.util package in repository.
rikima
@rikima_twitter
in smile-math, there is smile.util package, in there, there is SparseArray.java
Ryan Bennett
@rwbennett

Hi, I have a question about the new(?) OLS.fit() method. If I try to use it to predict housing sale prices, like OLS.fit(Formula.lhs("SalePrice"), X_train_dataframe), it fails with "no response variable".

So it seems I can only use OLS if I pass it a dataframe that includes X and y together. However, if I use it like "OLS.fit(Formula.lhs("SalePrice"), training_dataframe), which includes both the dependent and independent columns (X and y), then predict requires an array of the same size, including a column for the value I wish to predict.

However, doing that and passing, say, 0 for the y value results in wildly incorrect predictions, and changing the value affects predictions. Is there not a way to use OLS without comingling X and y?

Anoukh Ashley
@anoukh_ashley_twitter
Hi. Is there a version of smile-mkl that can be used with scala 2.11 ? I can't seem to get sbt to download the dependency libraryDependencies += "com.github.haifengl" %% "smile-mkl" % "2.6.0"
Haifeng Li
@haifengl
@rwbennett the training data frame must have both X and y. but for prediction, it doesn't require y in the data frame. Make sure to use the latest version.
@anoukh_ashley_twitter smile-mkl is a pure java library. do libraryDependencies += "com.github.haifengl" % "smile-mkl" % "2.6.0"
Ryan Bennett
@rwbennett
@haifengl I'm using 2.6.0. My code (https://pastebin.com/FEUsXqpY) uses a dataframe with both X and y (21 columns), but when I try to predict using a 20-column double array, I get "java.lang.IllegalArgumentException: Invalid input vector size: 20, expected: 21". Shouldn't prediction work with a 20-column dataframe, given that I trained the model on a 21-column DF with Formula.lhs("SalePrices") ?
Haifeng Li
@haifengl
@rwbennett first of all, life is much easier to read data frame by smile's api. Why do go through tablesaw? for prediction, it is native to use Tuple/DataFrame. If you have to use double[], make sure to include the bias (aka 1) in your vector.
Ryan Bennett
@rwbennett
@haifengl , I'm using tablesaw because I'm following a somewhat poorly-explained Udemy course that uses tablesaw with smile. Thanks for the suggestion though; I'll look into using smile's dataframes directly instead. As for using the double array with a bias of 1, I assume you mean inserting a 1 at index 0 in the array? I'm not sure if the reason is that the array is assumed to be 1-indexed by smile, but at any rate, doing so gives a result that matches the example I'm following, so thank you very much for your help!
Haifeng Li
@haifengl
@rwbennett Smile is not 1-indexed. A linear model may or may not has a bias item. If you provide a tuple (e.g. dataframe.get(0)), smile will create the proper vector automatically. It is especially important when the data frame has categorical variables.
wholfy
@wholfy
@haifengl Hi, I would appreciate your advice.
I have a large data array, that needs to be partitioned into clusters, but I can't load it to RAM.
Is it possible to load the source array in a parts and uses GMeans so that the result is similar to the clustered source array?
Haifeng Li
@haifengl
@wholfy no, GMeans needs all the data.
hmf
@hmf
I am trying to cluster a large number of instances (60000). Trying hierarchical clustering fails because n(n-1)/2 exceeds the array's float length. Which algorithm would be the best to use? Do these avoid the O(N^2) space distance matrix: clarans, dbscan, minimum entropy clustering ?
Haifeng Li
@haifengl
@hmf these methods all works on large data
hmf
@hmf
@haifengl Thanks. Will try them on a 10k data-set.
lskowr
@lskowr
@haifengl Is Lanczos in Smile thread-safe (also on the native level)?
Murat Koptur
@mrtkp9993
Hello, I am newbie to Smile, I need to perform PCA and clustering but examples on http://haifengl.github.io/clustering.html didn't work for me.