Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • May 17 15:11
    freddyaboulton commented #902
  • May 17 15:11
    glemaitre commented #902
  • May 17 15:09
    freddyaboulton commented #902
  • May 17 07:50
    glemaitre closed #871
  • May 17 07:50
    glemaitre commented #871
  • May 17 06:52

    glemaitre on 0.9.X

    [doc build] (compare)

  • May 16 18:59
    glemaitre closed #894
  • May 16 18:59
    glemaitre commented #894
  • May 16 18:51

    glemaitre on master

    MNT update setup.py (compare)

  • May 16 18:44

    glemaitre on 0.9.1

    (compare)

  • May 16 18:43

    glemaitre on master

    DOC add whats new 0.9.1 (compare)

  • May 16 18:43

    glemaitre on 0.9.X

    MNT adapt for scikit-learn 1.1 … DOC add whats new 0.9.1 REL make 0.9.1 release (compare)

  • May 16 18:33

    glemaitre on master

    MNT rename CI build (compare)

  • May 16 14:25

    glemaitre on master

    MNT adapt for scikit-learn 1.1 … (compare)

  • May 16 14:25
    glemaitre closed #902
  • May 16 14:21
    codecov[bot] commented #902
  • May 16 14:14
    codecov[bot] commented #902
  • May 16 14:12
    codecov[bot] commented #902
  • May 16 14:11
    codecov[bot] commented #902
  • May 16 14:11
    codecov[bot] commented #902
Guillaume Lemaitre
@glemaitre
So now you can use the same scheme with the sampler
you only need to use the Pipeline from imblearn.pipeline such that it can handle sampler
and declare a list of all potential samplers to try and pass it to the grid.
Dennis
@ydennisy
Wow!
That is neat! Would be great to have that on the docs :)
Thanks @glemaitre
Guillaume Lemaitre
@glemaitre
but I don’t remember if we have it anywhere documented
Dennis
@ydennisy
Thanks for your help and the MOOC link, not seen that before!
Guillaume Lemaitre
@glemaitre
Regarding the mooc we will have an open session at the beginning of next year ;)
Soledad Galli
@solegalli
I made a (bayesian) search over different over- and under- samplers, cost sensitive learning and specific ensemble methods with optuna: https://www.kaggle.com/solegalli/nested-hyperparameter-spaces-with-optuna. Feedback is welcome.
statcom
@statcom
I want to use a custom distance metric for undersampling nearest neighbor (NN) methods. For example, KNeighborsClassifier in sklearn has an argument 'metric' to specify your own distance metric between instances. But I couldn't find any way to do that with, for example, CondensedNearestNeighbor or fit_resample.
Guillaume Lemaitre
@glemaitre
n_neighbors accept an arbritrary scikit-learn KNeighborsClassifier. So you can create a scikit-learn object with the desired metric and plug it into the n_neighbors of CondensedNeatestNeighbor
statcom
@statcom
I am confused. n_neighbors contains K value for the classifier. What do you mean by "plug it into n_neighbors"? Well, on second thought, I may understand what you meant but the suggestion sounds like a hack. In that case, I will just change your script to accommodate the metric. Thanks for your answer.
Guillaume Lemaitre
@glemaitre
n_neighbors : int or estimator object, default=None so you can pass an estimator as:
from collections import Counter 
from sklearn.datasets import fetch_mldata 
from sklearn.neighbors import KNeighborsClassifier
from imblearn.under_sampling import CondensedNearestNeighbour 
pima = fetch_mldata('diabetes_scale') 
X, y = pima['data'], pima['target'] 
print('Original dataset shape %s' % Counter(y)) 
cnn = CondensedNearestNeighbour(random_state=42, n_neighbors=KNeighborsClassifier(metric=“euclidean"))
statcom
@statcom
Thanks a lot. With your answer, I was able to run the example. But I am having an issue with sampling_strategy for the NN methods. If I left it as default, sampling changes {0: 500, 1: 268} to {0:211, 1:268}, but if I change the option to "all", I get {0:211, 1:1}. But I want to maintain the sampling rate 0.5 so that my target sampling would be {0:250, 1:134}. Is there anyway to do that with NN methods?
Guillaume Lemaitre
@glemaitre
you cannot have the exact ratio that you want
You might control better using the n_neighbors of the KNeighborsClassifier(n_neighbors=2, metric=“euclidean”). By default, scikit-learn set it to 5. The original CNN is using a 1-NN rule. It explains why you get a really different results
statcom
@statcom
'all' sampling_strategy still produced {0: 145, 1: 1} with your suggestions. Your answer corresponds to what I figured from your code. My ultimate goal with these methods is not balancing data over classes, but selecting instances for a certain accuracy. So I don't need the exact ratio but similar numbers.
Guillaume Lemaitre
@glemaitre
In this case, you might want to grid-search the parameters to find the best imbalanced ratio
statcom
@statcom
Will try. Thanks!
Konrad
@KonuTech
Hello. First of all sorry if my question will not be specific enough. I wonder if an example from https://imbalanced-learn.org/stable/references/generated/imblearn.keras.BalancedBatchGenerator.html is not working with Tensorflow 2.7.0 and imbalanced-learn 0.8.1 . I understand that I have to install Tensorflow to get Keras. Thanks for reply in advance
Guillaume Lemaitre
@glemaitre
I am actually trying to fix several issue right now with the CI
and I will revise the compaibility with tensorflow and keras
You might want to downgrade tensorflow in the meanwhile or wait a bit for the upcoming release
Konrad
@KonuTech
Thanks
iyuiasd7
@iyuiasd7
hello, I tried to oversampling with ADASYN, but an exception was thrown(Not any neigbours belong to the majority class. This case will induce a NaN case with a division by zero. ADASYN is not suited for this specific dataset. Use SMOTE instead).
Guillaume Lemaitre
@glemaitre
Dis you try SMOTE then?
iyuiasd7
@iyuiasd7
Yes, SMOTE is working properly, not sure what caused it
"In fact, ADASYN focuses on generating samples next to the original samples which are wrongly classified using a k-Nearest Neighbors classifier while the basic implementation of SMOTE will not make any distinction between easy and hard samples to be classified using the nearest neighbors rule."
So if the part of the algorithm that try to find the difficult sample fail
then ADASYN does not work
This is an algorithmic problem with wrong assumption that does not apply to your dataset
iyuiasd7
@iyuiasd7
Thanks for your explanation
wzh19980708
@wzh19980708
c397f6ac1d98c80a028243fede40922.png
My recent running example reported an error. I don't know what went wrong at that step
image.png
Guillaume Lemaitre
@glemaitre
scikit-learn 1.0.1 broke some of internals
the problem is solved in master but I will need to release
it will happen in the coming days
wzh19980708
@wzh19980708
Thanks for your explanation. The earlier version of scikit-learn won't have this problem, will it?
Guillaume Lemaitre
@glemaitre
it would work with 1.0.0
gmaravel
@gmaravel:matrix.org
[m]

Hi to all! I have a question in order to understand how SMOTEENN is working exactly. I am using the following resampling_model = SMOTEENN(smote=SMOTE(), enn=EditedNearestNeighbours(sampling_strategy='not majority')). So what I expect from this is to resample all classes except for the majority one (since I am not putting anything for sampling strategy it calls the 'auto' which is 'not majority' - right?) and the ENN will clear all boundary cases except for the majority one in this case. I show you the results of resampling (for some folds in a loop):

-> Repetition #1 :: split #1

Original set of 745 objects:
id=0 -> n=200 , 26.846%
id=5 -> n=34 , 4.564%
id=4 -> n=396 , 53.154%
id=6 -> n=80 , 10.738%
id=2 -> n=18 , 2.416%
id=3 -> n=5 , 0.671%
id=1 -> n=12 , 1.611%
Counter({4: 396, 0: 200, 6: 80, 5: 34, 2: 18, 1: 12, 3: 5})

Final set of 2642 objects:
id=0 -> n=396 , 14.989%
id=1 -> n=393 , 14.875%
id=2 -> n=386 , 14.610%
id=3 -> n=396 , 14.989%
id=4 -> n=363 , 13.740%
id=5 -> n=359 , 13.588%
id=6 -> n=349 , 13.210%
Resampling from 745 to 2642 objects
Counter({0: 396, 3: 396, 1: 393, 2: 386, 4: 363, 5: 359, 6: 349})

-> Repetition #1 :: split #2

Original set of 745 objects:
id=4 -> n=397 , 53.289%
id=5 -> n=33 , 4.430%
id=0 -> n=200 , 26.846%
id=2 -> n=19 , 2.550%
id=6 -> n=79 , 10.604%
id=1 -> n=13 , 1.745%
id=3 -> n=4 , 0.537%
Counter({4: 397, 0: 200, 6: 79, 5: 33, 2: 19, 1: 13, 3: 4})

Final set of 2679 objects:
id=0 -> n=397 , 14.819%
id=1 -> n=394 , 14.707%
id=2 -> n=392 , 14.632%
id=3 -> n=397 , 14.819%
id=4 -> n=366 , 13.662%
id=5 -> n=364 , 13.587%
id=6 -> n=369 , 13.774%
Resampling from 745 to 2679 objects
Counter({0: 397, 3: 397, 1: 394, 2: 392, 6: 369, 4: 366, 5: 364})

-> Repetition #1 :: split #3

Original set of 746 objects:
id=4 -> n=397 , 53.217%
id=5 -> n=33 , 4.424%
id=0 -> n=200 , 26.810%
id=2 -> n=19 , 2.547%
id=6 -> n=79 , 10.590%
id=1 -> n=13 , 1.743%
id=3 -> n=5 , 0.670%
Counter({4: 397, 0: 200, 6: 79, 5: 33, 2: 19, 1: 13, 3: 5})

Final set of 2643 objects:

id=0 -> n=397 , 15.021%
id=1 -> n=396 , 14.983%
id=2 -> n=391 , 14.794%
id=3 -> n=394 , 14.907%
id=4 -> n=358 , 13.545%
id=5 -> n=359 , 13.583%
id=6 -> n=348 , 13.167%
Resampling from 746 to 2643 objects
Counter({0: 397, 1: 396, 3: 394, 2: 391, 5: 359, 4: 358, 6: 348})

What I noticed is that although the majority class is id4 the final result seems to keep intact id0 sources always. Is this because the samples are equalized from SMOTE and when parsed to ENN it keeps the first class as the majority one? Shouldn't it "keep track" of the majority class ? I would expect that id4 would remain intact (or am I misinterpreting something here)

Thanks and sorry for the long post

Guillaume Lemaitre
@glemaitre
Yes I assume that we apply SMOTE and then ENN without passing around information
So the majority for ENN is not the same than for SMOTE
Then I am not sure it would have an impact since ENN is just here to clean data point and you can still have noisy data in the majority class
I think that this is the reason why the default ENN in SMOTE will be set to "all"