Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • Jan 20 05:20
    chkoar commented #787
  • Jan 19 21:31
    GiuseppeMagazzu commented #787
  • Jan 17 22:32
    Ingkarat commented #889
  • Jan 17 21:45
    codecov[bot] commented #891
  • Jan 17 21:39
    codecov[bot] commented #891
  • Jan 17 21:39
    codecov[bot] commented #891
  • Jan 17 21:38
    codecov[bot] commented #891
  • Jan 17 21:36
    codecov[bot] commented #891
  • Jan 17 21:35
    codecov[bot] commented #891
  • Jan 17 21:35
    codecov[bot] commented #891
  • Jan 17 21:35
    codecov[bot] commented #891
  • Jan 17 21:34
    codecov[bot] commented #891
  • Jan 17 21:34
    codecov[bot] commented #891
  • Jan 17 21:33
    codecov[bot] commented #891
  • Jan 17 21:32
    codecov[bot] commented #891
  • Jan 17 21:25
    codecov[bot] commented #891
  • Jan 17 21:25
    glemaitre synchronize #891
  • Jan 17 21:24
    codecov[bot] commented #891
  • Jan 17 21:18
    glemaitre opened #891
  • Jan 17 20:58
    codecov[bot] commented #890
Guillaume Lemaitre
@glemaitre
and it usually beats any fancy resampling
and it is just much faster :)
whatever algorithm based on KNN does not scale properly
But this is only my 2 cents on the issues. I am happy to see application where this is indeed not the case :)
Billel Mokeddem
@mokeddembillel
ow, it sounds cool, actually, I was going to do the same because after I applied RandomUnderSampler, I was going to use XGBoost, I believe it's somewhat similar to what BalancedBaggingClassifier does. but anyway, I will try your solution. thanks for helping
Guillaume Lemaitre
@glemaitre
XGBoost is just the same as HistGradientBoostingClassifier from scikit-learn
but this is slower
In classification the loss will still be affected by the balancing issue
that's why the making an ensemble of GBDT that see balanced bootstrap samples could be better
Billel Mokeddem
@mokeddembillel
I see, thank you for the information
Soledad Galli
@solegalli
Hello, what is the stopping criteria for AllKNN? In the comments in the source code I see that 2) either one class is dissappearing, which makes sense, but I don't understand 1) number of sample in other class becomes inferior to majority class? Would the majority class not have more examples already by definition? https://github.com/scikit-learn-contrib/imbalanced-learn/blob/e7ccf10/imblearn/under_sampling/prototype_selection/edited_nearest_neighbours.py#L583
Soledad Galli
@solegalli
Another question, which one is the official documentation website? https://imbalanced-learn.org or https://imbalanced-learn.readthedocs.io ?
Guillaume Lemaitre
@glemaitre
the first link is the right one
however the second link redirect to the first onne
Priyam Mehta
@prikmm
Hello, everyone, I have a usage question, which I have posted on SO, here's the link: https://stackoverflow.com/questions/65652054/not-able-to-feed-the-combined-smote-randomundersampler-pipeline-into-the-main, can someone can help me with this. Thank you.
Priyam Mehta
@prikmm
TypeError: All intermediate steps of the chain should be estimators that implement fit and transform or fit_resample. 'Pipeline(steps=[('smote', SMOTE(n_jobs=-1, random_state=42)), ('under', RandomUnderSampler(random_state=42))])' implements both) Also, can someone explain what this error means? The Pipeline only exposes fit and fit_resample methods, since, transform is not being implemented, the first condition is not met and the second one about fit_resample is being met. Then, shouldn't this work? Thank you.
Guillaume Lemaitre
@glemaitre
Can you post the entire traceback to check which transformer/resampler is raising the conditionn
oh I see
you smote_pipeline implement fit_resample and transform as well
Basically you can use an imbalanced learn pipeline within another pipeline (we did not think about it) because you have an ambuiguity
the pipeline does not know if it should call fit_resample or fit/transform
In your case, you should be able to solve this issue using a flat pipeline
Main_Pipeline = imb_Pipeline([
     ('feature_handler', FeatureTransformer(list(pearson_feature_vector.index))),
     ('smote', SMOTE()),
     ('random_under_sampler', RandomUnderSampler()),
     ('scaler', StandardScaler()),
     ('pca', PCA(n_components=0.99)),
     ('model', LogisticRegression(max_iter=1750)),
])
It should be the equivalent
Priyam Mehta
@prikmm

Please correct me if my understanding is lacking.
So, when I call fit to the Main_Pipeline , since smote_pipeline as a fit present, it is assumed that transform is also present, actually it doesn't, I tried to call transform and got an error:

AttributeError: 'RandomUnderSampler' object has no attribute 'transform'

Pipeline code:

Smote_Under_pipeline = imb_Pipeline([
    ('smote', SMOTE(random_state=rnd_state, n_jobs=-1)),
    ('under', RandomUnderSampler(random_state=rnd_state)),
]

, and accordingly because of assumption fit/transform and fit_resample both become available. This causes ambuiguity and the code blows up?

Guillaume Lemaitre
@glemaitre
Yes the sampler does not implement transform but the pipeline does
and try to call the transform of the underlying estimator
you need to do hasattr(smote_pipeline, "transform")
and you will see that this is true
yes there is an ambuiguity because we don't know if you would like to call transform or fit_resample
Priyam Mehta
@prikmm
Cool , thanks!!
Soledad Galli
@solegalli
Would it be possible to extend the functionality of the BalancedBaggingClassifier and BalancedRandomForests to other sampling techniques (eg, SMOTE) by allowing the user to enter the over or under-sampling method as parameter instead of hard-coding RandomUnderSampler?
Guillaume Lemaitre
@glemaitre
If I recall properly, they leverage sample_weight and therefore you would need to have a Sampler that store indices to build the sample_weight vector
The second consideration is computational performance
Random US/OS are not costly
adding sampler based on k-NN will not scale
and in practice, I am tending to think that RUS and ROS would be enough to alleviate the issue with an ensemble learner.
Soledad Galli
@solegalli
makes sense, thank you!
Hanchung Lee
@leehanchung

Hi,

I am getting error loading a trained imblearn.pipeline Pipeline saved by joblib. Getting this error message:

ModuleNotFoundError: No module named 'imblearn.over_sampling._smote.base'; 'imblearn.over_sampling._smote' is not a package

The trained pipeline was saved via joblib.dump(pipeline, 'filename.joblib'). Any tips as to where the saving and loading process went wrong?

Guillaume Lemaitre
@glemaitre
make sure that the version installed is the same as the version used to pickle
mcihat
@mcihat

Hello everyone.
I have an usage question about EasyEnsembleClassifier. I have a dataset which has 450.000 data inputs with 13 columns(12 features, 1 target). My dataset is imbalanced (1:50) so I decided to use EasyEnsembleClassifier. I realized that all the subsets are exactly same for all the estimators.
I found this issue which is similar to my problem: scikit-learn-contrib/imbalanced-learn#116
In theory classifier method should create subsets for each estimators. These subsets should have all minority class samples and select same number of samples from majority class. In my case I should have roughly 18000 samples in each subset (I have roughly 9000 samples in minority class). However when I use "estimatorssamples" method it seems like output arrays for my estimators are exactly same and all of them have size of complete training set(80% of my dataset). So I decided to make a test:
'''
import numpy as np
from sklearn.datasets import make_classification
from imblearn.ensemble import EasyEnsembleClassifier

X, y = make_classification(n_classes=2, class_sep=2, weights=[0.3, 0.7],
n_informative=3, n_redundant=1, flip_y=0,
n_features=20, n_clusters_per_class=1,
n_samples=10, random_state=1)

clf = EasyEnsembleClassifier(n_estimators=5, n_jobs=-1)

clf.fit(X, y)

arr = clf.estimatorssamples
arr

Output:
[array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])]
'''

What am I doing wrong here? Obviously I am missing a point.

Guillaume Lemaitre
@glemaitre
To check the sample used by each estimator, you should use
In [13]: for est in clf.estimators_:
    ...:     print(est[0].sample_indices_)
[4 6 7 0 9 2]
[4 6 7 5 8 3]
[4 6 7 1 2 5]
[4 6 7 3 1 5]
[4 6 7 3 5 2]
I am not sure what estimator_samples_ is reporting. It might be a bug then
This weird that we don't document it
oh I see, we should add it in the documentation
Guillaume Lemaitre
@glemaitre
estimator_samples_ gives the samples dispatch to the first estimator that later on will undersample
This attribute exist because we inherit from the BaggingClassifier from scikit-learn
mcihat
@mcihat

The code you provided works fine with my generated dataset but when I use it on my real dataset this is what I get:

clf = EasyEnsembleClassifier(n_estimators=5, n_jobs=-1, sampling_strategy = 1.0)
clf.fit(X_train, y_train)
for est in clf.estimators_:
    print(est[0].sample_indices_)

Output:
[279507 240017  23859 ...  94249  87790 120830]
[277730  75855  70104 ... 341432 318980 130029]
[166614    207  72374 ...  93568  76905 142951]
[304630  28272 143132 ... 159062 264981  41332]
[ 35943 358917  68200 ... 121931 209190 284075]

Is this a normal result? I would expect first three indices in each row to be the same. I mean; all of the samples that belong to the minority class are being used in all subsets. I am not saying this is wrong. I am just asking if this is normal?