Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • May 21 18:43
    glemaitre commented #894
  • May 21 18:37
    ngupta23 commented #894
  • May 17 15:11
    freddyaboulton commented #902
  • May 17 15:11
    glemaitre commented #902
  • May 17 15:09
    freddyaboulton commented #902
  • May 17 07:50
    glemaitre closed #871
  • May 17 07:50
    glemaitre commented #871
  • May 17 06:52

    glemaitre on 0.9.X

    [doc build] (compare)

  • May 16 18:59
    glemaitre closed #894
  • May 16 18:59
    glemaitre commented #894
  • May 16 18:51

    glemaitre on master

    MNT update setup.py (compare)

  • May 16 18:44

    glemaitre on 0.9.1

    (compare)

  • May 16 18:43

    glemaitre on master

    DOC add whats new 0.9.1 (compare)

  • May 16 18:43

    glemaitre on 0.9.X

    MNT adapt for scikit-learn 1.1 … DOC add whats new 0.9.1 REL make 0.9.1 release (compare)

  • May 16 18:33

    glemaitre on master

    MNT rename CI build (compare)

  • May 16 14:25

    glemaitre on master

    MNT adapt for scikit-learn 1.1 … (compare)

  • May 16 14:25
    glemaitre closed #902
  • May 16 14:21
    codecov[bot] commented #902
  • May 16 14:14
    codecov[bot] commented #902
  • May 16 14:12
    codecov[bot] commented #902
Mustafa Aldemir
@mstfldmr
My data is not categorical, it's description text in free format
Soledad Galli
@solegalli
Random question, do you have any experience on how widely used MetaCost is? https://homes.cs.washington.edu/~pedrod/papers/kdd99.pdf, I mean to work with imbalanced data? Supposedly wraps any classifier. Do you know a good Python implementation or are you planing to make it part of imabalanced-learn?
Guillaume Lemaitre
@glemaitre
There is a plan to maybe include it in scikit-learn
Eleni Markou
@emarkou
Hello! Maybe a usage question... I am using SMOTE and trying to oversample the minority class to a specific ratio. So I am passing a float value between (0,1] to the sampling_strategy argument. No matter the value I set, even 1, I always get "ValueError: The specified ratio required to remove samples from the minority class while trying to generate new samples. Please increase the ratio."
The distribution of my target variable is the following so anything above 0.1 roughly I was expecting to be fine. Am I missing something?
image.png
Christos Aridas
@chkoar
@emarkou could you post an mre and you versions;
Christos Aridas
@chkoar
In any case, without an mre, I suppose that your target ratio is to low in order to generate new examples. In the oversampling case the sampling_strategy is proportionally to the majority class. Having said that, in the case of 90:10 you will need at least 0.13 in order to generate (and add) a single minority instance. So the new ratio will be 90:11.
Billel Mokeddem
@mokeddembillel
Hey guys, can anyone here review my issue (scikit-learn-contrib/imbalanced-learn#781), it's about adding a new feature to Condensed Nearest Neighbour and I want to start working on it but first I want to hear your opinion. thank you
Guillaume Lemaitre
@glemaitre
The issue is that AllKNN is applying ENN several time
so we can stop after a certain number of iteration based on some criteria
which could be considered as an early stopping
CNN is only doing a single iteration
We have an inner iteration that go through all samples to decide whether to exclude or not some samples
but stopping there would be a bad idea
You might treat only a specific area of your data distribution then which would not be beneficial I think
at least this mis my intuition
Probably, tuning the hyperparameters would be better in this case.
Billel Mokeddem
@mokeddembillel
yes, I think you're right, it would be a bad idea, but even with tuning the hyperparameters, it doesn't work in all situations I tried that, so I ended up using Random Undersampler and it yielded the best result
Guillaume Lemaitre
@glemaitre
It depends what is exactly the problem that you want to solve
but by experience, the recipes that work to handle balancing issue is
to train a BalancedBaggingClassifier (that use a RandomUnderSampler with a strong learner as a HistGradientBoosting
and it usually beats any fancy resampling
and it is just much faster :)
whatever algorithm based on KNN does not scale properly
But this is only my 2 cents on the issues. I am happy to see application where this is indeed not the case :)
Billel Mokeddem
@mokeddembillel
ow, it sounds cool, actually, I was going to do the same because after I applied RandomUnderSampler, I was going to use XGBoost, I believe it's somewhat similar to what BalancedBaggingClassifier does. but anyway, I will try your solution. thanks for helping
Guillaume Lemaitre
@glemaitre
XGBoost is just the same as HistGradientBoostingClassifier from scikit-learn
but this is slower
In classification the loss will still be affected by the balancing issue
that's why the making an ensemble of GBDT that see balanced bootstrap samples could be better
Billel Mokeddem
@mokeddembillel
I see, thank you for the information
Soledad Galli
@solegalli
Hello, what is the stopping criteria for AllKNN? In the comments in the source code I see that 2) either one class is dissappearing, which makes sense, but I don't understand 1) number of sample in other class becomes inferior to majority class? Would the majority class not have more examples already by definition? https://github.com/scikit-learn-contrib/imbalanced-learn/blob/e7ccf10/imblearn/under_sampling/prototype_selection/edited_nearest_neighbours.py#L583
Soledad Galli
@solegalli
Another question, which one is the official documentation website? https://imbalanced-learn.org or https://imbalanced-learn.readthedocs.io ?
Guillaume Lemaitre
@glemaitre
the first link is the right one
however the second link redirect to the first onne
Priyam Mehta
@prikmm
Hello, everyone, I have a usage question, which I have posted on SO, here's the link: https://stackoverflow.com/questions/65652054/not-able-to-feed-the-combined-smote-randomundersampler-pipeline-into-the-main, can someone can help me with this. Thank you.
Priyam Mehta
@prikmm
TypeError: All intermediate steps of the chain should be estimators that implement fit and transform or fit_resample. 'Pipeline(steps=[('smote', SMOTE(n_jobs=-1, random_state=42)), ('under', RandomUnderSampler(random_state=42))])' implements both) Also, can someone explain what this error means? The Pipeline only exposes fit and fit_resample methods, since, transform is not being implemented, the first condition is not met and the second one about fit_resample is being met. Then, shouldn't this work? Thank you.
Guillaume Lemaitre
@glemaitre
Can you post the entire traceback to check which transformer/resampler is raising the conditionn
oh I see
you smote_pipeline implement fit_resample and transform as well
Basically you can use an imbalanced learn pipeline within another pipeline (we did not think about it) because you have an ambuiguity
the pipeline does not know if it should call fit_resample or fit/transform
In your case, you should be able to solve this issue using a flat pipeline
Main_Pipeline = imb_Pipeline([
     ('feature_handler', FeatureTransformer(list(pearson_feature_vector.index))),
     ('smote', SMOTE()),
     ('random_under_sampler', RandomUnderSampler()),
     ('scaler', StandardScaler()),
     ('pca', PCA(n_components=0.99)),
     ('model', LogisticRegression(max_iter=1750)),
])
It should be the equivalent
Priyam Mehta
@prikmm

Please correct me if my understanding is lacking.
So, when I call fit to the Main_Pipeline , since smote_pipeline as a fit present, it is assumed that transform is also present, actually it doesn't, I tried to call transform and got an error:

AttributeError: 'RandomUnderSampler' object has no attribute 'transform'

Pipeline code:

Smote_Under_pipeline = imb_Pipeline([
    ('smote', SMOTE(random_state=rnd_state, n_jobs=-1)),
    ('under', RandomUnderSampler(random_state=rnd_state)),
]

, and accordingly because of assumption fit/transform and fit_resample both become available. This causes ambuiguity and the code blows up?

Guillaume Lemaitre
@glemaitre
Yes the sampler does not implement transform but the pipeline does
and try to call the transform of the underlying estimator