Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • 22:33
    genvalen synchronize #20592
  • 21:14
    github-actions[bot] labeled #20603
  • 21:13
    glemaitre converted_to_draft #20603
  • 21:13
    glemaitre opened #20603
  • 20:44
    glemaitre commented #20601
  • 20:44
    glemaitre closed #20601
  • 20:05
    fbidu synchronize #20380
  • 20:05
    fbidu synchronize #20380
  • 20:05
    fbidu synchronize #20380
  • 20:05
    fbidu synchronize #20380
  • 20:05
    fbidu synchronize #20380
  • 20:04
    fbidu synchronize #20380
  • 17:24
    michalkrawczyk review_requested #20031
  • 17:23
    michalkrawczyk commented #20031
  • 17:22
    michalkrawczyk commented #20031
  • 17:21
    michalkrawczyk commented #20031
  • 16:09
    adrinjalali closed #20590
  • 16:09
    adrinjalali commented #20590
  • 15:54
    michalkrawczyk synchronize #20031
  • 15:01
    michalkrawczyk synchronize #20031
Nicolas Hug
@NicolasHug
I'm not sure what you mean by best model. There's no notion of best model, only one model is built. If you mean "model with the lowest training loss" that's basically the model at the last iteration, under the assumption that the training loss is always supposed to decrease (unless your learning rate becomes too high). If you mean "model with the lowest training loss that doesn't make the validation loss go up", that's what early stopping is supposed to give you (and it's preferable to the former)
lesshaste
@lesshaste
@NicolasHug let's say the latter example you gave. The problem is that with early stopping you wait some number of iteration before deciding to stop. So the final iteration is not the best. That's why catboost for example has a use_best parameter.
It is common in early stopping for the final valudation loss to be higher than the loss a few epochs before. How long you wait to see if the loss will start going down again is sometimes called "patience" . I think that's what pytorch lightning calls it
Romaji Milton Amulo
@RomajiQuadhash
does using the fit function on a fitted model replace the fitted model, or update it?
I'm trying to use a Lasso in a machine learning context, and I want to keep updating it with each test run I do
Romaji Milton Amulo
@RomajiQuadhash
obviously, I could in theory, take the model, and train it with the results of the particular test run, then merge the coefficents with the last model myself, but it would be better if I could avoid that
Guillaume Lemaitre
@glemaitre
fit make a full training from scratch
partial_fit is doing an update
Romaji Milton Amulo
@RomajiQuadhash
@glemaitre thank you. what kinds of models is partial_fit available for?
hrm... it seems like all of the ones with that method only are for classification, not for regressed output.
Guillaume Lemaitre
@glemaitre
SGD estimator is one of them
Romaji Milton Amulo
@RomajiQuadhash
oh perfect
I will use that in the project my team is doing. Thanks for helping
Alex Ioannides
@AlexIoannides
I wrote a short blog post that might be of interest to the community - deploying Scikit-Learn models to Kuberentes using Bodywork (an open source deployment tool that I have developed).
Uroš Nedić
@urosn
I would like to ask how to join two pereprocessors I did saved in two separate files. I have one file with a model and another with prerpocessor (doing average and filling NaN boxes) then another file with a model (same estimator) and forth file is preprocessor for second model? I would like to merge these four files into two (one joint model and one joint estimator).
3 replies
Uroš (Урош)
@urosn:matrix.org
[m]
I have transformer1.file, model1.file, transformer2.file and model2.file (same estimator in model1 and model2). I would like to have tranformer_composite.file and model_composite.file.
Barricks
@Barrick-San
Yo
Zhengze Zhou
@ZhengzeZhou
Hi everyone, I'm a graduate student at Cornell and I had a paper (https://dl.acm.org/doi/abs/10.1145/3429445) published a while ago in correcting the bias of feature importance in tree-based methods. Impurity-based feature importances can be misleading for high cardinality features (many unique values), which is already noted in the docstring of featureimportances in RandomForest. I just opened a new pull request #20058 to implement a new feature importance measurement based on out-of-bag samples, which is guaranteed to remove this bias. I think this feature is going to be useful for scikit-learn users. Any comments or suggestions will be helpful!
Kirill
@PetrovKP
Hi! What week scheduled for release scikit-learn 1.0?
Nicolas Hug
@NicolasHug
There is no specific week scheduled @PetrovKP , but we try to release every 6 months and the previous one was released in december
Guillaume Lemaitre
@glemaitre
So perfectly June but I think that for 1.0 we want a couple of feature to be inside the release so we might be delayed.
Zoe Prieto
@zoeprieto_twitter
Hi, my name is Zoe Prieto. I am currently working on neutron and photon transport problems. I have some questions and maybe one of you can help me.
Roman Yurchak
@rthy:matrix.org
[m]
Sure, don't hesitate to write them here.
Zoe Prieto
@zoeprieto_twitter
Thanks! I have a list of particles with their caracteristics (position, direction, energy and stadistic weight). This variables are correlated. I want to fit that curves and later sample new particles. I want to know how scikit-learn keep the correlation. I'm sorry if it is a beginner question. And thanks again.
Roman Yurchak
@rthy:matrix.org
[m]
Well you need to define what are your feature variables and the target variable. So for instance you could try to predict the position from all the other variables. Correlations would be taken into account depending on the model. So for instance if your model is linear the target would be a linear combination of the features. If you do have a known analytical relation between your variables it might be easier and more reliable to use scipy.optimize or scipy.odr to find the coefficients you would like to learn though.
Zoe Prieto
@zoeprieto_twitter
Thanks for your answer, I forgot to mention that I fit my data with KDE and I sample new particles from this model. Is this model keeping the correlation between the different variables? Or it assumes the variables are independent from each other? Thanks again!
2 replies
Isaack Mungui
@isaack-mungui
Hi everyone, I'm trying to create a streamlit app but face the following message when I run streamlit run <file.py>: Make this Notebook Trusted to load map: File -> Trust Notebook. I Googled this issue, and even after making Chrome my default browser, nothing changes. Please help.
nyanpasu
@nyanpasu:matrix.org
[m]

Hello! in sklearn.decomposition.PCA, how do I tell it which column represents the label?

For example, I have a dataframe with the following columns:

feature_0 feature_1 feature_2 label

How do I tell PCA that label is the dependent variable?

Nicolas Hug
@NicolasHug
@nyanpasu:matrix.org you don't, PCA is unsupervised and doesn't take the labels as input, only the features.
1 reply
Felipe Fronchetti
@fronchetti
Hi folks, I am a master's student in CS and I have a question for you. I am working on a multi-class text classification problem, and I am using scikit-learn to implement my solution. I want to predict for a paragraph x if x belongs to one out of seven categories of information. I already implemented my solution using your library, but I am not confident if the steps I am following are correct or not, or if I am missing something. Could you please take a look at the image below and give your opinion? If this is not the right place for this kind of question, please let me know. Thank you in advance for your contribution! Image
4 replies
stimils2
@stimils2
Hi, I want to start working on the Sci-kit learn bug fixes. Anyone who is already working can I team up with you?
Stanimir Ivanov
@Stanimir-Ivanov
Hi all! We're working on a generic implementation of a discrete time survival model for random forests. Similar to this and this. Basically, the idea is to split on hazard curves which are a bit like the class probabilities of regular classification random forests but then stratified per duration since inception of an observation. We want to use scikit-learn for a base. Is anyone here familiar with the random forest code? Also tips for a good PR are very welcome.
um_duaa
@um_duaa:matrix.org
[m]
hi
الحمدلله
@um_duaa123_twitter
I have only one question, please!!!
lesshaste
@lesshaste
What would people recommend for clustering strings (e.g. english words) of the same length?
lesshaste
@lesshaste
or is this better off at github discuss?
Nicolas Hug
@NicolasHug

It really depends on the kind of data that you have. If you have a corpus of documents LDA would be one way to get cluster/topics https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html
You could also try pre-trained embeddings like word2vec and the likes

why do they have to be of the same length?

Ariel Silvio Norberto RAMOS
@asnramos
Hello ... Greetings to all..!!! I will participate in the Sprint on Saturday June 26..!!!
Nicolas Hug
@NicolasHug
Welcome @asnramos !
José Chacón
@jchaconm
Hello all . I'm also participating in the sprint next saturday,i'm excited to be able to help checking and fixing an issue!
Temiloluwa Awoyele
@temmyzeus
Hello, how can I join the sprint?
Aditya Acharya
@acharya_aditya_mi_gitlab
@um_duaa123_twitter sure ask
Temiloluwa Awoyele
@temmyzeus
Thanks
Harsh Kumar
@HarshVardhanKumar
why isn't the website working?
Adrin Jalali
@adrinjalali
works for me
Harsh Kumar
@HarshVardhanKumar
Now it also works for me. Had tried with two different networks yesterday... didn't work that time..
anyway, I wanted to ask what version of LAPACK (libblas.so) does sklearn use (assuming it uses it. If not, what blas library is used)?
3 replies
lesshaste
@lesshaste
If I have a neural network classifier I can easily simulate data from the probability distribution implied by the classifier. Can this be done for any of the classifiers in scikit learn?
Harsh Kumar
@HarshVardhanKumar
scikit-learn custom compilation: Is it possible to pass custom gcc flags during scikit-learn build as described here https://scikit-learn.org/stable/developers/advanced_installation.html