Python module to perform under sampling and over sampling with various techniques.
n_neighbors
of the KNeighborsClassifier(n_neighbors=2, metric=“euclidean”)
. By default, scikit-learn set it to 5. The original CNN is using a 1-NN rule. It explains why you get a really different results
master
but I will need to release
Hi to all! I have a question in order to understand how SMOTEENN is working exactly. I am using the following resampling_model = SMOTEENN(smote=SMOTE(), enn=EditedNearestNeighbours(sampling_strategy='not majority')). So what I expect from this is to resample all classes except for the majority one (since I am not putting anything for sampling strategy it calls the 'auto' which is 'not majority' - right?) and the ENN will clear all boundary cases except for the majority one in this case. I show you the results of resampling (for some folds in a loop):
Original set of 745 objects:
id=0 -> n=200 , 26.846%
id=5 -> n=34 , 4.564%
id=4 -> n=396 , 53.154%
id=6 -> n=80 , 10.738%
id=2 -> n=18 , 2.416%
id=3 -> n=5 , 0.671%
id=1 -> n=12 , 1.611%
Counter({4: 396, 0: 200, 6: 80, 5: 34, 2: 18, 1: 12, 3: 5})
Final set of 2642 objects:
id=0 -> n=396 , 14.989%
id=1 -> n=393 , 14.875%
id=2 -> n=386 , 14.610%
id=3 -> n=396 , 14.989%
id=4 -> n=363 , 13.740%
id=5 -> n=359 , 13.588%
id=6 -> n=349 , 13.210%
Resampling from 745 to 2642 objects
Counter({0: 396, 3: 396, 1: 393, 2: 386, 4: 363, 5: 359, 6: 349})
Original set of 745 objects:
id=4 -> n=397 , 53.289%
id=5 -> n=33 , 4.430%
id=0 -> n=200 , 26.846%
id=2 -> n=19 , 2.550%
id=6 -> n=79 , 10.604%
id=1 -> n=13 , 1.745%
id=3 -> n=4 , 0.537%
Counter({4: 397, 0: 200, 6: 79, 5: 33, 2: 19, 1: 13, 3: 4})
Final set of 2679 objects:
id=0 -> n=397 , 14.819%
id=1 -> n=394 , 14.707%
id=2 -> n=392 , 14.632%
id=3 -> n=397 , 14.819%
id=4 -> n=366 , 13.662%
id=5 -> n=364 , 13.587%
id=6 -> n=369 , 13.774%
Resampling from 745 to 2679 objects
Counter({0: 397, 3: 397, 1: 394, 2: 392, 6: 369, 4: 366, 5: 364})
Original set of 746 objects:
id=4 -> n=397 , 53.217%
id=5 -> n=33 , 4.424%
id=0 -> n=200 , 26.810%
id=2 -> n=19 , 2.547%
id=6 -> n=79 , 10.590%
id=1 -> n=13 , 1.743%
id=3 -> n=5 , 0.670%
Counter({4: 397, 0: 200, 6: 79, 5: 33, 2: 19, 1: 13, 3: 5})
Final set of 2643 objects:
id=0 -> n=397 , 15.021%
id=1 -> n=396 , 14.983%
id=2 -> n=391 , 14.794%
id=3 -> n=394 , 14.907%
id=4 -> n=358 , 13.545%
id=5 -> n=359 , 13.583%
id=6 -> n=348 , 13.167%
Resampling from 746 to 2643 objects
Counter({0: 397, 1: 396, 3: 394, 2: 391, 5: 359, 4: 358, 6: 348})
What I noticed is that although the majority class is id4 the final result seems to keep intact id0 sources always. Is this because the samples are equalized from SMOTE and when parsed to ENN it keeps the first class as the majority one? Shouldn't it "keep track" of the majority class ? I would expect that id4 would remain intact (or am I misinterpreting something here)
Thanks and sorry for the long post
"all"
Thank you for answer. I did the following test: I used the default sampling strategy (so 'not majority' for SMOTE and 'all' for ENN) and I removed the first class just to check what is the behavior with the rest (now the majority class is id3). The results are the following (indicative rums)
Original set of 545 objects:
id=3 -> n=396 , 72.661%
id=5 -> n=80 , 14.679%
id=0 -> n=13 , 2.385%
id=2 -> n=5 , 0.917%
id=1 -> n=18 , 3.303%
id=4 -> n=33 , 6.055%
Counter({3: 396, 5: 80, 4: 33, 1: 18, 0: 13, 2: 5})
Final set of 2294 objects:
id=0 -> n=396 , 17.262%
id=1 -> n=389 , 16.957%
id=2 -> n=396 , 17.262%
id=3 -> n=365 , 15.911%
id=4 -> n=373 , 16.260%
id=5 -> n=375 , 16.347%
Resampling from 545 to 2294 objects
Counter({0: 396, 2: 396, 1: 389, 5: 375, 4: 373, 3: 365})
Original set of 545 objects:
id=3 -> n=397 , 72.844%
id=5 -> n=79 , 14.495%
id=1 -> n=18 , 3.303%
id=0 -> n=13 , 2.385%
id=4 -> n=34 , 6.239%
id=2 -> n=4 , 0.734%
Counter({3: 397, 5: 79, 4: 34, 1: 18, 0: 13, 2: 4})
Final set of 2325 objects:
id=0 -> n=397 , 17.075%
id=1 -> n=393 , 16.903%
id=2 -> n=397 , 17.075%
id=3 -> n=377 , 16.215%
id=4 -> n=386 , 16.602%
id=5 -> n=375 , 16.129%
Resampling from 545 to 2325 objects
Counter({0: 397, 2: 397, 1: 393, 4: 386, 3: 377, 5: 375})
Original set of 546 objects:
id=3 -> n=397 , 72.711%
id=5 -> n=79 , 14.469%
id=1 -> n=18 , 3.297%
id=0 -> n=13 , 2.381%
id=4 -> n=34 , 6.227%
id=2 -> n=5 , 0.916%
Counter({3: 397, 5: 79, 4: 34, 1: 18, 0: 13, 2: 5})
Final set of 2306 objects:
id=0 -> n=397 , 17.216%
id=1 -> n=394 , 17.086%
id=2 -> n=395 , 17.129%
id=3 -> n=369 , 16.002%
id=4 -> n=376 , 16.305%
id=5 -> n=375 , 16.262%
Resampling from 546 to 2306 objects
Counter({0: 397, 2: 395, 1: 394, 4: 376, 5: 375, 3: 369})
Again, the first class remains always the same with the majority one. Since this is a multiclass problem then the comparison during the ENN process is done for each class against all other (which are considered the majority class - so they are removed). In any case I cannot believe that not even one source is removed from id0 in all iterations. That makes me wondering...