Python module to perform under sampling and over sampling with various techniques.
glemaitre on 0.9.X
[doc build] (compare)
glemaitre on master
MNT update setup.py (compare)
glemaitre on 0.9.1
glemaitre on master
DOC add whats new 0.9.1 (compare)
glemaitre on 0.9.X
MNT adapt for scikit-learn 1.1 … DOC add whats new 0.9.1 REL make 0.9.1 release (compare)
glemaitre on master
MNT rename CI build (compare)
glemaitre on master
MNT adapt for scikit-learn 1.1 … (compare)
search_cv = GridSearchCV(model, param_grid=param_grid)
Pipeline
from imblearn.pipeline
such that it can handle sampler
from collections import Counter
from sklearn.datasets import fetch_mldata
from sklearn.neighbors import KNeighborsClassifier
from imblearn.under_sampling import CondensedNearestNeighbour
pima = fetch_mldata('diabetes_scale')
X, y = pima['data'], pima['target']
print('Original dataset shape %s' % Counter(y))
cnn = CondensedNearestNeighbour(random_state=42, n_neighbors=KNeighborsClassifier(metric=“euclidean"))
n_neighbors
of the KNeighborsClassifier(n_neighbors=2, metric=“euclidean”)
. By default, scikit-learn set it to 5. The original CNN is using a 1-NN rule. It explains why you get a really different results
master
but I will need to release
Hi to all! I have a question in order to understand how SMOTEENN is working exactly. I am using the following resampling_model = SMOTEENN(smote=SMOTE(), enn=EditedNearestNeighbours(sampling_strategy='not majority')). So what I expect from this is to resample all classes except for the majority one (since I am not putting anything for sampling strategy it calls the 'auto' which is 'not majority' - right?) and the ENN will clear all boundary cases except for the majority one in this case. I show you the results of resampling (for some folds in a loop):
Original set of 745 objects:
id=0 -> n=200 , 26.846%
id=5 -> n=34 , 4.564%
id=4 -> n=396 , 53.154%
id=6 -> n=80 , 10.738%
id=2 -> n=18 , 2.416%
id=3 -> n=5 , 0.671%
id=1 -> n=12 , 1.611%
Counter({4: 396, 0: 200, 6: 80, 5: 34, 2: 18, 1: 12, 3: 5})
Final set of 2642 objects:
id=0 -> n=396 , 14.989%
id=1 -> n=393 , 14.875%
id=2 -> n=386 , 14.610%
id=3 -> n=396 , 14.989%
id=4 -> n=363 , 13.740%
id=5 -> n=359 , 13.588%
id=6 -> n=349 , 13.210%
Resampling from 745 to 2642 objects
Counter({0: 396, 3: 396, 1: 393, 2: 386, 4: 363, 5: 359, 6: 349})
Original set of 745 objects:
id=4 -> n=397 , 53.289%
id=5 -> n=33 , 4.430%
id=0 -> n=200 , 26.846%
id=2 -> n=19 , 2.550%
id=6 -> n=79 , 10.604%
id=1 -> n=13 , 1.745%
id=3 -> n=4 , 0.537%
Counter({4: 397, 0: 200, 6: 79, 5: 33, 2: 19, 1: 13, 3: 4})
Final set of 2679 objects:
id=0 -> n=397 , 14.819%
id=1 -> n=394 , 14.707%
id=2 -> n=392 , 14.632%
id=3 -> n=397 , 14.819%
id=4 -> n=366 , 13.662%
id=5 -> n=364 , 13.587%
id=6 -> n=369 , 13.774%
Resampling from 745 to 2679 objects
Counter({0: 397, 3: 397, 1: 394, 2: 392, 6: 369, 4: 366, 5: 364})
Original set of 746 objects:
id=4 -> n=397 , 53.217%
id=5 -> n=33 , 4.424%
id=0 -> n=200 , 26.810%
id=2 -> n=19 , 2.547%
id=6 -> n=79 , 10.590%
id=1 -> n=13 , 1.743%
id=3 -> n=5 , 0.670%
Counter({4: 397, 0: 200, 6: 79, 5: 33, 2: 19, 1: 13, 3: 5})
Final set of 2643 objects:
id=0 -> n=397 , 15.021%
id=1 -> n=396 , 14.983%
id=2 -> n=391 , 14.794%
id=3 -> n=394 , 14.907%
id=4 -> n=358 , 13.545%
id=5 -> n=359 , 13.583%
id=6 -> n=348 , 13.167%
Resampling from 746 to 2643 objects
Counter({0: 397, 1: 396, 3: 394, 2: 391, 5: 359, 4: 358, 6: 348})
What I noticed is that although the majority class is id4 the final result seems to keep intact id0 sources always. Is this because the samples are equalized from SMOTE and when parsed to ENN it keeps the first class as the majority one? Shouldn't it "keep track" of the majority class ? I would expect that id4 would remain intact (or am I misinterpreting something here)
Thanks and sorry for the long post