Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • 20:05
    fbidu synchronize #20380
  • 20:05
    fbidu synchronize #20380
  • 20:05
    fbidu synchronize #20380
  • 20:05
    fbidu synchronize #20380
  • 20:05
    fbidu synchronize #20380
  • 20:04
    fbidu synchronize #20380
  • 17:24
    michalkrawczyk review_requested #20031
  • 17:23
    michalkrawczyk commented #20031
  • 17:22
    michalkrawczyk commented #20031
  • 17:21
    michalkrawczyk commented #20031
  • 16:09
    adrinjalali closed #20590
  • 16:09
    adrinjalali commented #20590
  • 15:54
    michalkrawczyk synchronize #20031
  • 15:01
    michalkrawczyk synchronize #20031
  • 14:27
    michalkrawczyk synchronize #20031
  • 14:18
    michalkrawczyk synchronize #20031
  • 14:13
    cmarmo labeled #19409
  • 13:07
    zihuaihuai labeled #20601
  • 13:07
    zihuaihuai opened #20601
  • 12:33
    thomasjpfan synchronize #18010
lesshaste
@lesshaste
@rthy:matrix.org thanks. Do you know if that would be the same for logistic regression for example?
lesshaste
@lesshaste
can sklearn.metrics.pairwise be made to work for hamming or levenshtein distance?
lesshaste
@lesshaste
what are good options for supervised categorical encoding when the target is also categorical? target encoding looks attractive but doesn't really make sense when the target is categorical
Rishabh Chakrabarti
@rishacha
Hi everyone, I am a developer who's trying to test multiple models in different frameworks. I just want to know if this idea makes sense - I want to create a single sklearn pipeline script for testing various models (all mapped to the sklearn-keras interface or using skorch). But from what I understand is that models are not just plain classifiers or regressors. The models are a combination and sklearn pipeline doesn't apparently support it. Is my understanding correct? Is this a futile effort? Can someone please help me out with this issue. Thank you.
If there's an alternative, please let me know. I'm talking about object detection models. I really like the pipeline method/interface and would like to extend my models to match the same .fit, .predict interface
lesshaste
@lesshaste
HistGradientBoostingClassifier seems to have no n_jobs argument. Is there any way set the number of threads/cores?
Nicolas Hug
@NicolasHug
lesshaste
@lesshaste
@NicolasHug thank you. Is anyone working on adding n_jobs for this classifier?
it would make it inline with the other classifiers
and can it be done in the script itself?
Nicolas Hug
@NicolasHug
it's been discussed but we ended up staying with the status quo scikit-learn/scikit-learn#14265
lesshaste
@lesshaste
@NicolasHug that is surprising. I normally agree with all the decisions of scikit learn devs
I have two questions about HistGradientBoostingClassifier. a) When using early stopping do you end up with the "best model" according to the validation loss or the most recent one after it stops?
b) Is the validation set chosen by HistGradientBoostingClassifier chosen at random and is it the same set for every iteration of the training?
maybe these should be asked on github as an issue?
Nicolas Hug
@NicolasHug
a) there's no notion of best model. early stopping stops the training process if the score hasn't improved by more than tol in the last n_iter_no_change iterations. The score can be the loss or an arbitrary scorer and it can be computed on the training set or on the validation set
b) yes and yes
lesshaste
@lesshaste
Thank you. Maybe best model couid be a good addition?
Nicolas Hug
@NicolasHug
I'm not sure what you mean by best model. There's no notion of best model, only one model is built. If you mean "model with the lowest training loss" that's basically the model at the last iteration, under the assumption that the training loss is always supposed to decrease (unless your learning rate becomes too high). If you mean "model with the lowest training loss that doesn't make the validation loss go up", that's what early stopping is supposed to give you (and it's preferable to the former)
lesshaste
@lesshaste
@NicolasHug let's say the latter example you gave. The problem is that with early stopping you wait some number of iteration before deciding to stop. So the final iteration is not the best. That's why catboost for example has a use_best parameter.
It is common in early stopping for the final valudation loss to be higher than the loss a few epochs before. How long you wait to see if the loss will start going down again is sometimes called "patience" . I think that's what pytorch lightning calls it
Romaji Milton Amulo
@RomajiQuadhash
does using the fit function on a fitted model replace the fitted model, or update it?
I'm trying to use a Lasso in a machine learning context, and I want to keep updating it with each test run I do
Romaji Milton Amulo
@RomajiQuadhash
obviously, I could in theory, take the model, and train it with the results of the particular test run, then merge the coefficents with the last model myself, but it would be better if I could avoid that
Guillaume Lemaitre
@glemaitre
fit make a full training from scratch
partial_fit is doing an update
Romaji Milton Amulo
@RomajiQuadhash
@glemaitre thank you. what kinds of models is partial_fit available for?
hrm... it seems like all of the ones with that method only are for classification, not for regressed output.
Guillaume Lemaitre
@glemaitre
SGD estimator is one of them
Romaji Milton Amulo
@RomajiQuadhash
oh perfect
I will use that in the project my team is doing. Thanks for helping
Alex Ioannides
@AlexIoannides
I wrote a short blog post that might be of interest to the community - deploying Scikit-Learn models to Kuberentes using Bodywork (an open source deployment tool that I have developed).
Uroš Nedić
@urosn
I would like to ask how to join two pereprocessors I did saved in two separate files. I have one file with a model and another with prerpocessor (doing average and filling NaN boxes) then another file with a model (same estimator) and forth file is preprocessor for second model? I would like to merge these four files into two (one joint model and one joint estimator).
3 replies
Uroš (Урош)
@urosn:matrix.org
[m]
I have transformer1.file, model1.file, transformer2.file and model2.file (same estimator in model1 and model2). I would like to have tranformer_composite.file and model_composite.file.
Barricks
@Barrick-San
Yo
Zhengze Zhou
@ZhengzeZhou
Hi everyone, I'm a graduate student at Cornell and I had a paper (https://dl.acm.org/doi/abs/10.1145/3429445) published a while ago in correcting the bias of feature importance in tree-based methods. Impurity-based feature importances can be misleading for high cardinality features (many unique values), which is already noted in the docstring of featureimportances in RandomForest. I just opened a new pull request #20058 to implement a new feature importance measurement based on out-of-bag samples, which is guaranteed to remove this bias. I think this feature is going to be useful for scikit-learn users. Any comments or suggestions will be helpful!
Kirill
@PetrovKP
Hi! What week scheduled for release scikit-learn 1.0?
Nicolas Hug
@NicolasHug
There is no specific week scheduled @PetrovKP , but we try to release every 6 months and the previous one was released in december
Guillaume Lemaitre
@glemaitre
So perfectly June but I think that for 1.0 we want a couple of feature to be inside the release so we might be delayed.
Zoe Prieto
@zoeprieto_twitter
Hi, my name is Zoe Prieto. I am currently working on neutron and photon transport problems. I have some questions and maybe one of you can help me.
Roman Yurchak
@rthy:matrix.org
[m]
Sure, don't hesitate to write them here.
Zoe Prieto
@zoeprieto_twitter
Thanks! I have a list of particles with their caracteristics (position, direction, energy and stadistic weight). This variables are correlated. I want to fit that curves and later sample new particles. I want to know how scikit-learn keep the correlation. I'm sorry if it is a beginner question. And thanks again.
Roman Yurchak
@rthy:matrix.org
[m]
Well you need to define what are your feature variables and the target variable. So for instance you could try to predict the position from all the other variables. Correlations would be taken into account depending on the model. So for instance if your model is linear the target would be a linear combination of the features. If you do have a known analytical relation between your variables it might be easier and more reliable to use scipy.optimize or scipy.odr to find the coefficients you would like to learn though.
Zoe Prieto
@zoeprieto_twitter
Thanks for your answer, I forgot to mention that I fit my data with KDE and I sample new particles from this model. Is this model keeping the correlation between the different variables? Or it assumes the variables are independent from each other? Thanks again!
2 replies
Isaack Mungui
@isaack-mungui
Hi everyone, I'm trying to create a streamlit app but face the following message when I run streamlit run <file.py>: Make this Notebook Trusted to load map: File -> Trust Notebook. I Googled this issue, and even after making Chrome my default browser, nothing changes. Please help.
nyanpasu
@nyanpasu:matrix.org
[m]

Hello! in sklearn.decomposition.PCA, how do I tell it which column represents the label?

For example, I have a dataframe with the following columns:

feature_0 feature_1 feature_2 label

How do I tell PCA that label is the dependent variable?

Nicolas Hug
@NicolasHug
@nyanpasu:matrix.org you don't, PCA is unsupervised and doesn't take the labels as input, only the features.
1 reply
Felipe Fronchetti
@fronchetti
Hi folks, I am a master's student in CS and I have a question for you. I am working on a multi-class text classification problem, and I am using scikit-learn to implement my solution. I want to predict for a paragraph x if x belongs to one out of seven categories of information. I already implemented my solution using your library, but I am not confident if the steps I am following are correct or not, or if I am missing something. Could you please take a look at the image below and give your opinion? If this is not the right place for this kind of question, please let me know. Thank you in advance for your contribution! Image
4 replies
stimils2
@stimils2
Hi, I want to start working on the Sci-kit learn bug fixes. Anyone who is already working can I team up with you?
Stanimir Ivanov
@Stanimir-Ivanov
Hi all! We're working on a generic implementation of a discrete time survival model for random forests. Similar to this and this. Basically, the idea is to split on hazard curves which are a bit like the class probabilities of regular classification random forests but then stratified per duration since inception of an observation. We want to use scikit-learn for a base. Is anyone here familiar with the random forest code? Also tips for a good PR are very welcome.
um_duaa
@um_duaa:matrix.org
[m]
hi