Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • 19:22
    amueller closed #13307
  • 19:22
    amueller commented #13307
  • 19:13
    amueller commented #18010
  • 19:10
    amueller commented #18010
  • 19:04
    asaadeldin11 commented #18177
  • 18:32
    jygerardy commented #20451
  • 18:01
    jjerphan edited #20254
  • 18:00
    jjerphan commented #20254
  • 17:53
    jygerardy synchronize #20451
  • 17:50
    jjerphan edited #20254
  • 17:22
    JulienB-78 edited #20610
  • 17:20
    JulienB-78 labeled #20610
  • 17:20
    JulienB-78 opened #20610
  • 17:17
    thomasjpfan synchronize #18010
  • 16:28
    mbatoul synchronize #20481
  • 16:23
    thomasjpfan commented #18010
  • 16:23
    thomasjpfan commented #18010
  • 16:21
    TomDLT commented #20488
  • 16:17
    thomasjpfan synchronize #18010
  • 16:08
    jygerardy synchronize #20451
Guillaume Lemaitre
@glemaitre
Another solution would be to use an approximative nearest neighbors instead. Something like annoy: https://github.com/spotify/annoy
razou
@razou
Thanks @glemaitre for your helpful answer
razou
@razou
Does it exists an implementation of Hit Rate (or Hit Ratio) (generally used in recommendation engines) metric in scikit-learn ?
lesshaste
@lesshaste
I have one hot encoded feature vectors which I am using for multiclass classification. If in my training set there is a feature which is always 0, what happens in testing when it comes across one that is a 1 for the feature?
rthy
@rthy:matrix.org
[m]
@lesshaste: By default OHE would error. You would need to set handle_unknown='ignore' to ignore it.
lesshaste
@lesshaste
@rthy:matrix.org thanks. Do you know if that would be the same for logistic regression for example?
lesshaste
@lesshaste
can sklearn.metrics.pairwise be made to work for hamming or levenshtein distance?
lesshaste
@lesshaste
what are good options for supervised categorical encoding when the target is also categorical? target encoding looks attractive but doesn't really make sense when the target is categorical
Rishabh Chakrabarti
@rishacha
Hi everyone, I am a developer who's trying to test multiple models in different frameworks. I just want to know if this idea makes sense - I want to create a single sklearn pipeline script for testing various models (all mapped to the sklearn-keras interface or using skorch). But from what I understand is that models are not just plain classifiers or regressors. The models are a combination and sklearn pipeline doesn't apparently support it. Is my understanding correct? Is this a futile effort? Can someone please help me out with this issue. Thank you.
If there's an alternative, please let me know. I'm talking about object detection models. I really like the pipeline method/interface and would like to extend my models to match the same .fit, .predict interface
lesshaste
@lesshaste
HistGradientBoostingClassifier seems to have no n_jobs argument. Is there any way set the number of threads/cores?
Nicolas Hug
@NicolasHug
lesshaste
@lesshaste
@NicolasHug thank you. Is anyone working on adding n_jobs for this classifier?
it would make it inline with the other classifiers
and can it be done in the script itself?
Nicolas Hug
@NicolasHug
it's been discussed but we ended up staying with the status quo scikit-learn/scikit-learn#14265
lesshaste
@lesshaste
@NicolasHug that is surprising. I normally agree with all the decisions of scikit learn devs
I have two questions about HistGradientBoostingClassifier. a) When using early stopping do you end up with the "best model" according to the validation loss or the most recent one after it stops?
b) Is the validation set chosen by HistGradientBoostingClassifier chosen at random and is it the same set for every iteration of the training?
maybe these should be asked on github as an issue?
Nicolas Hug
@NicolasHug
a) there's no notion of best model. early stopping stops the training process if the score hasn't improved by more than tol in the last n_iter_no_change iterations. The score can be the loss or an arbitrary scorer and it can be computed on the training set or on the validation set
b) yes and yes
lesshaste
@lesshaste
Thank you. Maybe best model couid be a good addition?
Nicolas Hug
@NicolasHug
I'm not sure what you mean by best model. There's no notion of best model, only one model is built. If you mean "model with the lowest training loss" that's basically the model at the last iteration, under the assumption that the training loss is always supposed to decrease (unless your learning rate becomes too high). If you mean "model with the lowest training loss that doesn't make the validation loss go up", that's what early stopping is supposed to give you (and it's preferable to the former)
lesshaste
@lesshaste
@NicolasHug let's say the latter example you gave. The problem is that with early stopping you wait some number of iteration before deciding to stop. So the final iteration is not the best. That's why catboost for example has a use_best parameter.
It is common in early stopping for the final valudation loss to be higher than the loss a few epochs before. How long you wait to see if the loss will start going down again is sometimes called "patience" . I think that's what pytorch lightning calls it
Romaji Milton Amulo
@RomajiQuadhash
does using the fit function on a fitted model replace the fitted model, or update it?
I'm trying to use a Lasso in a machine learning context, and I want to keep updating it with each test run I do
Romaji Milton Amulo
@RomajiQuadhash
obviously, I could in theory, take the model, and train it with the results of the particular test run, then merge the coefficents with the last model myself, but it would be better if I could avoid that
Guillaume Lemaitre
@glemaitre
fit make a full training from scratch
partial_fit is doing an update
Romaji Milton Amulo
@RomajiQuadhash
@glemaitre thank you. what kinds of models is partial_fit available for?
hrm... it seems like all of the ones with that method only are for classification, not for regressed output.
Guillaume Lemaitre
@glemaitre
SGD estimator is one of them
Romaji Milton Amulo
@RomajiQuadhash
oh perfect
I will use that in the project my team is doing. Thanks for helping
Alex Ioannides
@AlexIoannides
I wrote a short blog post that might be of interest to the community - deploying Scikit-Learn models to Kuberentes using Bodywork (an open source deployment tool that I have developed).
Uroš Nedić
@urosn
I would like to ask how to join two pereprocessors I did saved in two separate files. I have one file with a model and another with prerpocessor (doing average and filling NaN boxes) then another file with a model (same estimator) and forth file is preprocessor for second model? I would like to merge these four files into two (one joint model and one joint estimator).
3 replies
Uroš (Урош)
@urosn:matrix.org
[m]
I have transformer1.file, model1.file, transformer2.file and model2.file (same estimator in model1 and model2). I would like to have tranformer_composite.file and model_composite.file.
Barricks
@Barrick-San
Yo
Zhengze Zhou
@ZhengzeZhou
Hi everyone, I'm a graduate student at Cornell and I had a paper (https://dl.acm.org/doi/abs/10.1145/3429445) published a while ago in correcting the bias of feature importance in tree-based methods. Impurity-based feature importances can be misleading for high cardinality features (many unique values), which is already noted in the docstring of featureimportances in RandomForest. I just opened a new pull request #20058 to implement a new feature importance measurement based on out-of-bag samples, which is guaranteed to remove this bias. I think this feature is going to be useful for scikit-learn users. Any comments or suggestions will be helpful!
Kirill
@PetrovKP
Hi! What week scheduled for release scikit-learn 1.0?
Nicolas Hug
@NicolasHug
There is no specific week scheduled @PetrovKP , but we try to release every 6 months and the previous one was released in december
Guillaume Lemaitre
@glemaitre
So perfectly June but I think that for 1.0 we want a couple of feature to be inside the release so we might be delayed.
Zoe Prieto
@zoeprieto_twitter
Hi, my name is Zoe Prieto. I am currently working on neutron and photon transport problems. I have some questions and maybe one of you can help me.
Roman Yurchak
@rthy:matrix.org
[m]
Sure, don't hesitate to write them here.
Zoe Prieto
@zoeprieto_twitter
Thanks! I have a list of particles with their caracteristics (position, direction, energy and stadistic weight). This variables are correlated. I want to fit that curves and later sample new particles. I want to know how scikit-learn keep the correlation. I'm sorry if it is a beginner question. And thanks again.
Roman Yurchak
@rthy:matrix.org
[m]
Well you need to define what are your feature variables and the target variable. So for instance you could try to predict the position from all the other variables. Correlations would be taken into account depending on the model. So for instance if your model is linear the target would be a linear combination of the features. If you do have a known analytical relation between your variables it might be easier and more reliable to use scipy.optimize or scipy.odr to find the coefficients you would like to learn though.
Zoe Prieto
@zoeprieto_twitter
Thanks for your answer, I forgot to mention that I fit my data with KDE and I sample new particles from this model. Is this model keeping the correlation between the different variables? Or it assumes the variables are independent from each other? Thanks again!
2 replies
Isaack Mungui
@isaack-mungui
Hi everyone, I'm trying to create a streamlit app but face the following message when I run streamlit run <file.py>: Make this Notebook Trusted to load map: File -> Trust Notebook. I Googled this issue, and even after making Chrome my default browser, nothing changes. Please help.
nyanpasu
@nyanpasu:matrix.org
[m]

Hello! in sklearn.decomposition.PCA, how do I tell it which column represents the label?

For example, I have a dataframe with the following columns:

feature_0 feature_1 feature_2 label

How do I tell PCA that label is the dependent variable?

Nicolas Hug
@NicolasHug
@nyanpasu:matrix.org you don't, PCA is unsupervised and doesn't take the labels as input, only the features.
1 reply