Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • 08:19
    BatMrE commented #20132
  • 07:53
    markusloecher commented #20953
  • 03:24
    github-actions[bot] labeled #21084
  • 03:24
    github-actions[bot] labeled #21084
  • 03:24
    jmloyola opened #21084
  • 02:51
    jmloyola commented #20308
  • 00:06
    github-actions[bot] labeled #21083
  • 00:06
    github-actions[bot] labeled #21083
  • 00:06
    jmloyola opened #21083
  • Sep 18 22:58
    jmloyola commented #20308
  • Sep 18 21:49
    github-actions[bot] labeled #21082
  • Sep 18 21:49
    github-actions[bot] labeled #21082
  • Sep 18 21:48
    thomasjpfan opened #21082
  • Sep 18 21:24
    JakobRiber closed #21066
  • Sep 18 21:24
    JakobRiber commented #21066
  • Sep 18 18:23
    Pawel-Kranzberg commented #19556
  • Sep 18 18:22
    Pawel-Kranzberg commented #19556
  • Sep 18 16:43
    lorentzenchr commented #21020
  • Sep 18 16:33
    lorentzenchr synchronize #21020
  • Sep 18 16:22
    Pawel-Kranzberg synchronize #19556
Adrin Jalali
@adrinjalali
@binarybana really nice!
Anirudh Vaish
@avaish1409

Hey there!

I am trying to develop a new python module for video chatting with a bot. (You may also collaborate)

Repo: https://github.com/avaish1409/VideoChatBot/
Gitter: https://gitter.im/VideoChatBot/community

Downloads: 429 (in first 2 days of launch)

''' pip install VideoChatBot '''
or
''' pip install https://files.pythonhosted.org/packages/5b/cc/9dbb790525fe3daa8f0822e60eec38dfea8af5e33af0334dc66b4a022ac4/VideoChatBot-0.0.2.tar.gz '''

Do contribute on github, let's build it together!

Plz star the repository if you like it .. you can also contribute on github 😁

lester1027
@lester1027
Hello, everyone. I am new here. I have found that the user guide of sklearn 0.24 is well-structured. Is there any way for me to download a pdf version of it? I could only find those of the older versions. Thank you.
Nicolas Hug
@NicolasHug
@lester1027 I think we stopped generating pdf versions and simply provide the html files, so the docs look like just as on the website. Generating pdfs involved Latex and it was difficult to maintain (random failures every now and then, etc)
if you really want pdfs you can try converting the docs to pdf with pandoc
razou
@razou

Hello,
I'm using sklearn.neighbors.NearestNeighbors (scikit-learn==0.24.0) to find nearest neighbors

knnModel = NearestNeighbors(n_neighbors=5, algorithm='auto',  metric='minkowski', p=2, n_jobs=-1)
tes_df.apply(lambda u: knnModel.kneighbors(X=u.values.reshape(1, -1),  n_neighbors=5, return_distance=False).ravel().tolist(), axis=1)

test_df contains 5K rows and 77 columns (from one hot encoding) and the execution is around 32 minutes
(train_df's shape => (1754249, 77))

Is it normal to have this high execution ?
Any tips to improve the performance and educe this execution time) ?
Thanks

Guillaume Lemaitre
@glemaitre
For each sample in test_df you have to compute the distances with each sample in train_df. So it is the reason why it is pretty costly.
Potential workaround: if you have some chance, you might want to create prototype by clustering your training data and compute KNN to the centroids instead of the original data
Another solution would be to use an approximative nearest neighbors instead. Something like annoy: https://github.com/spotify/annoy
razou
@razou
Thanks @glemaitre for your helpful answer
razou
@razou
Does it exists an implementation of Hit Rate (or Hit Ratio) (generally used in recommendation engines) metric in scikit-learn ?
lesshaste
@lesshaste
I have one hot encoded feature vectors which I am using for multiclass classification. If in my training set there is a feature which is always 0, what happens in testing when it comes across one that is a 1 for the feature?
rthy
@rthy:matrix.org
[m]
@lesshaste: By default OHE would error. You would need to set handle_unknown='ignore' to ignore it.
lesshaste
@lesshaste
@rthy:matrix.org thanks. Do you know if that would be the same for logistic regression for example?
lesshaste
@lesshaste
can sklearn.metrics.pairwise be made to work for hamming or levenshtein distance?
lesshaste
@lesshaste
what are good options for supervised categorical encoding when the target is also categorical? target encoding looks attractive but doesn't really make sense when the target is categorical
Rishabh Chakrabarti
@rishacha
Hi everyone, I am a developer who's trying to test multiple models in different frameworks. I just want to know if this idea makes sense - I want to create a single sklearn pipeline script for testing various models (all mapped to the sklearn-keras interface or using skorch). But from what I understand is that models are not just plain classifiers or regressors. The models are a combination and sklearn pipeline doesn't apparently support it. Is my understanding correct? Is this a futile effort? Can someone please help me out with this issue. Thank you.
If there's an alternative, please let me know. I'm talking about object detection models. I really like the pipeline method/interface and would like to extend my models to match the same .fit, .predict interface
lesshaste
@lesshaste
HistGradientBoostingClassifier seems to have no n_jobs argument. Is there any way set the number of threads/cores?
Nicolas Hug
@NicolasHug
lesshaste
@lesshaste
@NicolasHug thank you. Is anyone working on adding n_jobs for this classifier?
it would make it inline with the other classifiers
and can it be done in the script itself?
Nicolas Hug
@NicolasHug
it's been discussed but we ended up staying with the status quo scikit-learn/scikit-learn#14265
lesshaste
@lesshaste
@NicolasHug that is surprising. I normally agree with all the decisions of scikit learn devs
I have two questions about HistGradientBoostingClassifier. a) When using early stopping do you end up with the "best model" according to the validation loss or the most recent one after it stops?
b) Is the validation set chosen by HistGradientBoostingClassifier chosen at random and is it the same set for every iteration of the training?
maybe these should be asked on github as an issue?
Nicolas Hug
@NicolasHug
a) there's no notion of best model. early stopping stops the training process if the score hasn't improved by more than tol in the last n_iter_no_change iterations. The score can be the loss or an arbitrary scorer and it can be computed on the training set or on the validation set
b) yes and yes
lesshaste
@lesshaste
Thank you. Maybe best model couid be a good addition?
Nicolas Hug
@NicolasHug
I'm not sure what you mean by best model. There's no notion of best model, only one model is built. If you mean "model with the lowest training loss" that's basically the model at the last iteration, under the assumption that the training loss is always supposed to decrease (unless your learning rate becomes too high). If you mean "model with the lowest training loss that doesn't make the validation loss go up", that's what early stopping is supposed to give you (and it's preferable to the former)
lesshaste
@lesshaste
@NicolasHug let's say the latter example you gave. The problem is that with early stopping you wait some number of iteration before deciding to stop. So the final iteration is not the best. That's why catboost for example has a use_best parameter.
It is common in early stopping for the final valudation loss to be higher than the loss a few epochs before. How long you wait to see if the loss will start going down again is sometimes called "patience" . I think that's what pytorch lightning calls it
Romaji Milton Amulo
@RomajiQuadhash
does using the fit function on a fitted model replace the fitted model, or update it?
I'm trying to use a Lasso in a machine learning context, and I want to keep updating it with each test run I do
Romaji Milton Amulo
@RomajiQuadhash
obviously, I could in theory, take the model, and train it with the results of the particular test run, then merge the coefficents with the last model myself, but it would be better if I could avoid that
Guillaume Lemaitre
@glemaitre
fit make a full training from scratch
partial_fit is doing an update
Romaji Milton Amulo
@RomajiQuadhash
@glemaitre thank you. what kinds of models is partial_fit available for?
hrm... it seems like all of the ones with that method only are for classification, not for regressed output.
Guillaume Lemaitre
@glemaitre
SGD estimator is one of them
Romaji Milton Amulo
@RomajiQuadhash
oh perfect
I will use that in the project my team is doing. Thanks for helping
Alex Ioannides
@AlexIoannides
I wrote a short blog post that might be of interest to the community - deploying Scikit-Learn models to Kuberentes using Bodywork (an open source deployment tool that I have developed).
Uroš Nedić
@urosn
I would like to ask how to join two pereprocessors I did saved in two separate files. I have one file with a model and another with prerpocessor (doing average and filling NaN boxes) then another file with a model (same estimator) and forth file is preprocessor for second model? I would like to merge these four files into two (one joint model and one joint estimator).
3 replies
Uroš (Урош)
@urosn:matrix.org
[m]
I have transformer1.file, model1.file, transformer2.file and model2.file (same estimator in model1 and model2). I would like to have tranformer_composite.file and model_composite.file.
Barricks
@Barrick-San
Yo
Zhengze Zhou
@ZhengzeZhou
Hi everyone, I'm a graduate student at Cornell and I had a paper (https://dl.acm.org/doi/abs/10.1145/3429445) published a while ago in correcting the bias of feature importance in tree-based methods. Impurity-based feature importances can be misleading for high cardinality features (many unique values), which is already noted in the docstring of featureimportances in RandomForest. I just opened a new pull request #20058 to implement a new feature importance measurement based on out-of-bag samples, which is guaranteed to remove this bias. I think this feature is going to be useful for scikit-learn users. Any comments or suggestions will be helpful!
Kirill
@PetrovKP
Hi! What week scheduled for release scikit-learn 1.0?
Nicolas Hug
@NicolasHug
There is no specific week scheduled @PetrovKP , but we try to release every 6 months and the previous one was released in december
Guillaume Lemaitre
@glemaitre
So perfectly June but I think that for 1.0 we want a couple of feature to be inside the release so we might be delayed.