Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • 17:00
    glemaitre commented #894
  • 16:59
    glemaitre commented #894
  • 16:59
    glemaitre commented #894
  • 16:31
    Genarito commented #894
  • 07:23
    tvdboom commented #904
  • Jun 22 11:01
    joaopfonseca commented #857
  • Jun 22 10:50
    joaopfonseca commented #860
  • Jun 22 10:49
    joaopfonseca commented #860
  • Jun 21 18:11
    glemaitre commented #872
  • Jun 21 17:39
    jaymon0703 commented #872
  • Jun 19 16:01
    lringel opened #911
  • Jun 18 04:51
    codecov[bot] commented #910
  • Jun 18 04:49
    codecov[bot] commented #910
  • Jun 18 04:49
    codecov[bot] commented #910
  • Jun 18 04:48
    codecov[bot] commented #910
  • Jun 18 04:46
    codecov[bot] commented #910
  • Jun 18 04:44
    codecov[bot] commented #910
  • Jun 18 04:42
    codecov[bot] commented #910
  • Jun 18 04:41
    codecov[bot] commented #910
  • Jun 18 04:41
    codecov[bot] commented #910
Guillaume Lemaitre
@glemaitre
In this case, you might want to grid-search the parameters to find the best imbalanced ratio
statcom
@statcom
Will try. Thanks!
Konrad
@KonuTech
Hello. First of all sorry if my question will not be specific enough. I wonder if an example from https://imbalanced-learn.org/stable/references/generated/imblearn.keras.BalancedBatchGenerator.html is not working with Tensorflow 2.7.0 and imbalanced-learn 0.8.1 . I understand that I have to install Tensorflow to get Keras. Thanks for reply in advance
Guillaume Lemaitre
@glemaitre
I am actually trying to fix several issue right now with the CI
and I will revise the compaibility with tensorflow and keras
You might want to downgrade tensorflow in the meanwhile or wait a bit for the upcoming release
Konrad
@KonuTech
Thanks
iyuiasd7
@iyuiasd7
hello, I tried to oversampling with ADASYN, but an exception was thrown(Not any neigbours belong to the majority class. This case will induce a NaN case with a division by zero. ADASYN is not suited for this specific dataset. Use SMOTE instead).
Guillaume Lemaitre
@glemaitre
Dis you try SMOTE then?
iyuiasd7
@iyuiasd7
Yes, SMOTE is working properly, not sure what caused it
"In fact, ADASYN focuses on generating samples next to the original samples which are wrongly classified using a k-Nearest Neighbors classifier while the basic implementation of SMOTE will not make any distinction between easy and hard samples to be classified using the nearest neighbors rule."
So if the part of the algorithm that try to find the difficult sample fail
then ADASYN does not work
This is an algorithmic problem with wrong assumption that does not apply to your dataset
iyuiasd7
@iyuiasd7
Thanks for your explanation
wzh19980708
@wzh19980708
c397f6ac1d98c80a028243fede40922.png
My recent running example reported an error. I don't know what went wrong at that step
image.png
Guillaume Lemaitre
@glemaitre
scikit-learn 1.0.1 broke some of internals
the problem is solved in master but I will need to release
it will happen in the coming days
wzh19980708
@wzh19980708
Thanks for your explanation. The earlier version of scikit-learn won't have this problem, will it?
Guillaume Lemaitre
@glemaitre
it would work with 1.0.0
gmaravel
@gmaravel:matrix.org
[m]

Hi to all! I have a question in order to understand how SMOTEENN is working exactly. I am using the following resampling_model = SMOTEENN(smote=SMOTE(), enn=EditedNearestNeighbours(sampling_strategy='not majority')). So what I expect from this is to resample all classes except for the majority one (since I am not putting anything for sampling strategy it calls the 'auto' which is 'not majority' - right?) and the ENN will clear all boundary cases except for the majority one in this case. I show you the results of resampling (for some folds in a loop):

-> Repetition #1 :: split #1

Original set of 745 objects:
id=0 -> n=200 , 26.846%
id=5 -> n=34 , 4.564%
id=4 -> n=396 , 53.154%
id=6 -> n=80 , 10.738%
id=2 -> n=18 , 2.416%
id=3 -> n=5 , 0.671%
id=1 -> n=12 , 1.611%
Counter({4: 396, 0: 200, 6: 80, 5: 34, 2: 18, 1: 12, 3: 5})

Final set of 2642 objects:
id=0 -> n=396 , 14.989%
id=1 -> n=393 , 14.875%
id=2 -> n=386 , 14.610%
id=3 -> n=396 , 14.989%
id=4 -> n=363 , 13.740%
id=5 -> n=359 , 13.588%
id=6 -> n=349 , 13.210%
Resampling from 745 to 2642 objects
Counter({0: 396, 3: 396, 1: 393, 2: 386, 4: 363, 5: 359, 6: 349})

-> Repetition #1 :: split #2

Original set of 745 objects:
id=4 -> n=397 , 53.289%
id=5 -> n=33 , 4.430%
id=0 -> n=200 , 26.846%
id=2 -> n=19 , 2.550%
id=6 -> n=79 , 10.604%
id=1 -> n=13 , 1.745%
id=3 -> n=4 , 0.537%
Counter({4: 397, 0: 200, 6: 79, 5: 33, 2: 19, 1: 13, 3: 4})

Final set of 2679 objects:
id=0 -> n=397 , 14.819%
id=1 -> n=394 , 14.707%
id=2 -> n=392 , 14.632%
id=3 -> n=397 , 14.819%
id=4 -> n=366 , 13.662%
id=5 -> n=364 , 13.587%
id=6 -> n=369 , 13.774%
Resampling from 745 to 2679 objects
Counter({0: 397, 3: 397, 1: 394, 2: 392, 6: 369, 4: 366, 5: 364})

-> Repetition #1 :: split #3

Original set of 746 objects:
id=4 -> n=397 , 53.217%
id=5 -> n=33 , 4.424%
id=0 -> n=200 , 26.810%
id=2 -> n=19 , 2.547%
id=6 -> n=79 , 10.590%
id=1 -> n=13 , 1.743%
id=3 -> n=5 , 0.670%
Counter({4: 397, 0: 200, 6: 79, 5: 33, 2: 19, 1: 13, 3: 5})

Final set of 2643 objects:

id=0 -> n=397 , 15.021%
id=1 -> n=396 , 14.983%
id=2 -> n=391 , 14.794%
id=3 -> n=394 , 14.907%
id=4 -> n=358 , 13.545%
id=5 -> n=359 , 13.583%
id=6 -> n=348 , 13.167%
Resampling from 746 to 2643 objects
Counter({0: 397, 1: 396, 3: 394, 2: 391, 5: 359, 4: 358, 6: 348})

What I noticed is that although the majority class is id4 the final result seems to keep intact id0 sources always. Is this because the samples are equalized from SMOTE and when parsed to ENN it keeps the first class as the majority one? Shouldn't it "keep track" of the majority class ? I would expect that id4 would remain intact (or am I misinterpreting something here)

Thanks and sorry for the long post

Guillaume Lemaitre
@glemaitre
Yes I assume that we apply SMOTE and then ENN without passing around information
So the majority for ENN is not the same than for SMOTE
Then I am not sure it would have an impact since ENN is just here to clean data point and you can still have noisy data in the majority class
I think that this is the reason why the default ENN in SMOTE will be set to "all"
gmaravel
@gmaravel:matrix.org
[m]

Thank you for answer. I did the following test: I used the default sampling strategy (so 'not majority' for SMOTE and 'all' for ENN) and I removed the first class just to check what is the behavior with the rest (now the majority class is id3). The results are the following (indicative rums)

-> Repetition #1 :: split #1

Original set of 545 objects:
id=3 -> n=396 , 72.661%
id=5 -> n=80 , 14.679%
id=0 -> n=13 , 2.385%
id=2 -> n=5 , 0.917%
id=1 -> n=18 , 3.303%
id=4 -> n=33 , 6.055%
Counter({3: 396, 5: 80, 4: 33, 1: 18, 0: 13, 2: 5})

Final set of 2294 objects:
id=0 -> n=396 , 17.262%
id=1 -> n=389 , 16.957%
id=2 -> n=396 , 17.262%
id=3 -> n=365 , 15.911%
id=4 -> n=373 , 16.260%
id=5 -> n=375 , 16.347%
Resampling from 545 to 2294 objects
Counter({0: 396, 2: 396, 1: 389, 5: 375, 4: 373, 3: 365})

-> Repetition #1 :: split #2

Original set of 545 objects:
id=3 -> n=397 , 72.844%
id=5 -> n=79 , 14.495%
id=1 -> n=18 , 3.303%
id=0 -> n=13 , 2.385%
id=4 -> n=34 , 6.239%
id=2 -> n=4 , 0.734%
Counter({3: 397, 5: 79, 4: 34, 1: 18, 0: 13, 2: 4})

Final set of 2325 objects:
id=0 -> n=397 , 17.075%
id=1 -> n=393 , 16.903%
id=2 -> n=397 , 17.075%
id=3 -> n=377 , 16.215%
id=4 -> n=386 , 16.602%
id=5 -> n=375 , 16.129%
Resampling from 545 to 2325 objects
Counter({0: 397, 2: 397, 1: 393, 4: 386, 3: 377, 5: 375})

-> Repetition #1 :: split #3

Original set of 546 objects:
id=3 -> n=397 , 72.711%
id=5 -> n=79 , 14.469%
id=1 -> n=18 , 3.297%
id=0 -> n=13 , 2.381%
id=4 -> n=34 , 6.227%
id=2 -> n=5 , 0.916%
Counter({3: 397, 5: 79, 4: 34, 1: 18, 0: 13, 2: 5})

Final set of 2306 objects:
id=0 -> n=397 , 17.216%
id=1 -> n=394 , 17.086%
id=2 -> n=395 , 17.129%
id=3 -> n=369 , 16.002%
id=4 -> n=376 , 16.305%
id=5 -> n=375 , 16.262%
Resampling from 546 to 2306 objects
Counter({0: 397, 2: 395, 1: 394, 4: 376, 5: 375, 3: 369})

Again, the first class remains always the same with the majority one. Since this is a multiclass problem then the comparison during the ENN process is done for each class against all other (which are considered the majority class - so they are removed). In any case I cannot believe that not even one source is removed from id0 in all iterations. That makes me wondering...

gmaravel
@gmaravel:matrix.org
[m]
Hello! Do you have any updates on this issue? I am wondering as I have been using this as an improvement to a submitted version of a paper, and I am (unfortunately) pressed a bit by the time. If there is not direct solution then I will need to drop all this effort and replace it with something else (perhaps simple SMOTE solutions even though it is noisier).
Guillaume Lemaitre
@glemaitre
You are welcome to implement the fix
toth12
@toth12
Hello, I have a time series data with categorical variables; it is unevenly distributed over time; I would like to balance it and simply analyse it; no machine learning so it has no explicit target variables. I am writing to ask if I could use any of your methods. I am a bit confused, could I for instance use year from the time series data as target variable and create more samples for each year? Many thanks.
Guillaume Lemaitre
@glemaitre
be aware that the resampler only works when your target is discrete and thus a classification problem
toth12
@toth12
cheers
yamashi
@yamashi
Hi there, I am trying to use SMOTENC but am having issues with memory usage...
My input shape is (150000, 6700) with only 7 non categorical inputs, when I try to fit_resample it crashes my collab session due to ram usage, is there anything I can do to solve this issue?
yamashi
@yamashi
For anyone in the future having the same issue, my workaround was to first make an autoencoder of the categories and then just use SMOTE on the latent space