Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • Jun 30 15:29
    lringel edited #911
  • Jun 27 17:00
    glemaitre commented #894
  • Jun 27 16:59
    glemaitre commented #894
  • Jun 27 16:59
    glemaitre commented #894
  • Jun 27 16:31
    Genarito commented #894
  • Jun 27 07:23
    tvdboom commented #904
  • Jun 22 11:01
    joaopfonseca commented #857
  • Jun 22 10:50
    joaopfonseca commented #860
  • Jun 22 10:49
    joaopfonseca commented #860
  • Jun 21 18:11
    glemaitre commented #872
  • Jun 21 17:39
    jaymon0703 commented #872
  • Jun 19 16:01
    lringel opened #911
  • Jun 18 04:51
    codecov[bot] commented #910
  • Jun 18 04:49
    codecov[bot] commented #910
  • Jun 18 04:49
    codecov[bot] commented #910
  • Jun 18 04:48
    codecov[bot] commented #910
  • Jun 18 04:46
    codecov[bot] commented #910
  • Jun 18 04:44
    codecov[bot] commented #910
  • Jun 18 04:42
    codecov[bot] commented #910
  • Jun 18 04:41
    codecov[bot] commented #910
Guillaume Lemaitre
@glemaitre
this is usually linked to the application and thus this is not a parameter to be optimized
You can imagine the following with credit-card fraud detection
in which false positive and false negative will not have the same cost
but the cost could be defined as a real cost in dollars
linked to the business side
György Kovács
@gykovacs
Yep, this makes total sense. On the flipside, instance-level "complicatedness"/"hard-to-learn-ness" is usually appearing in oversampling techniques, which might serve as the basis of local costs. I found this so obvious that I just wondered if I am searching for wrong terms to find anything like this.
Also, I could imagine some local density estimation based costs. If the density is smaller, then the cost of misclassification should be higher.
In order to balance for differences in the densities of the classes, not only balancing the difference in the number of samples.
As global, class cardinality proportional weights do.
PanZiwei
@PanZiwei
Hi is it possible to get the element index so that I can know what data comes from the original dataset in the SMOTE upsampled dataset?

For reference I am using the SMOTE method for oversampling:

smoter = SMOTE(random_state=42, n_jobs=-1, sampling_strategy = 'not majority')

X_train_smote, y_train_smote = smoter.fit_resample(X_train, y_train)

To be more specific, I am wondering whether it is possible to know the index for X_train in the X_train_smote dataset.

Guillaume Lemaitre
@glemaitre
nop this is currently not possible
Jan Zyśko
@FrugoFruit90
Has anyone used imblearn SMOTE together with Pipeline and some explainability framework? e.g. shap, lime, eli5
My problem is that I try to explain my predictions, but for that I need to first transform the data to the state of "just before fitting the last estimator" (because the transformers in the Pipeline like custom vectorizers create new columns) before running the explainability package together with the last step - the estimator. For that, I can in principle use pipeline[:-1].transform(np.array(X_train)). However, I then get the error "AttributeError: 'SMOTE' object has no attribute 'transform'". I don't know how to proceed.
Mustafa Aldemir
@mstfldmr
I get "could not convert string to float" error from SMOTE, but RandomOverSampler works well.
import numpy as np
from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import SMOTE

x = np.array([['aaa'] * 100, ['bbb'] * 100]).T
y = np.array([0] * 10 + [1] * 90)

ros = RandomOverSampler()
x_res, y_res = ros.fit_sample(x, y)

smote = SMOTE()
x_res, y_res = smote.fit_sample(x, y)
Guillaume Lemaitre
@glemaitre
because RandomUnderSampler does not use X to over sample
while SMOTE interpolate and therefore apply numerical operation
thus you need to encode your data in some way
Soledad Galli
@solegalli

Hello team. I have a question about Borderline SMOTE:
The variant 2 is supposed to interpolate between the minority in danger and other neighbors from the minority, and the minority in danger and some neighbors from the majority.

In line https://github.com/scikit-learn-contrib/imbalanced-learn/blob/4162d2d/imblearn/over_sampling/_smote.py#L352
we train a KNN only on the minority class and then derive the neighbors nns from it, which we use for the interpolation.

Then we use that nns to obtain the neighbors from the majority class in the second part (https://github.com/scikit-learn-contrib/imbalanced-learn/blob/4162d2d/imblearn/over_sampling/_smote.py#L397) of the borderline-2 code. But would not nns contain only neighbours from the minority? as it is derived from a knn trained only in the minority class?

Mustafa Aldemir
@mstfldmr

thus you need to encode your data in some way

I have only 1 column in X and it is text. How should I encode it?

Mustafa Aldemir
@mstfldmr
My data is not categorical, it's description text in free format
Soledad Galli
@solegalli
Random question, do you have any experience on how widely used MetaCost is? https://homes.cs.washington.edu/~pedrod/papers/kdd99.pdf, I mean to work with imbalanced data? Supposedly wraps any classifier. Do you know a good Python implementation or are you planing to make it part of imabalanced-learn?
Guillaume Lemaitre
@glemaitre
There is a plan to maybe include it in scikit-learn
Eleni Markou
@emarkou
Hello! Maybe a usage question... I am using SMOTE and trying to oversample the minority class to a specific ratio. So I am passing a float value between (0,1] to the sampling_strategy argument. No matter the value I set, even 1, I always get "ValueError: The specified ratio required to remove samples from the minority class while trying to generate new samples. Please increase the ratio."
The distribution of my target variable is the following so anything above 0.1 roughly I was expecting to be fine. Am I missing something?
image.png
Christos Aridas
@chkoar
@emarkou could you post an mre and you versions;
Christos Aridas
@chkoar
In any case, without an mre, I suppose that your target ratio is to low in order to generate new examples. In the oversampling case the sampling_strategy is proportionally to the majority class. Having said that, in the case of 90:10 you will need at least 0.13 in order to generate (and add) a single minority instance. So the new ratio will be 90:11.
Billel Mokeddem
@mokeddembillel
Hey guys, can anyone here review my issue (scikit-learn-contrib/imbalanced-learn#781), it's about adding a new feature to Condensed Nearest Neighbour and I want to start working on it but first I want to hear your opinion. thank you
Guillaume Lemaitre
@glemaitre
The issue is that AllKNN is applying ENN several time
so we can stop after a certain number of iteration based on some criteria
which could be considered as an early stopping
CNN is only doing a single iteration
We have an inner iteration that go through all samples to decide whether to exclude or not some samples
but stopping there would be a bad idea
You might treat only a specific area of your data distribution then which would not be beneficial I think
at least this mis my intuition
Probably, tuning the hyperparameters would be better in this case.
Billel Mokeddem
@mokeddembillel
yes, I think you're right, it would be a bad idea, but even with tuning the hyperparameters, it doesn't work in all situations I tried that, so I ended up using Random Undersampler and it yielded the best result
Guillaume Lemaitre
@glemaitre
It depends what is exactly the problem that you want to solve
but by experience, the recipes that work to handle balancing issue is
to train a BalancedBaggingClassifier (that use a RandomUnderSampler with a strong learner as a HistGradientBoosting
and it usually beats any fancy resampling
and it is just much faster :)
whatever algorithm based on KNN does not scale properly
But this is only my 2 cents on the issues. I am happy to see application where this is indeed not the case :)
Billel Mokeddem
@mokeddembillel
ow, it sounds cool, actually, I was going to do the same because after I applied RandomUnderSampler, I was going to use XGBoost, I believe it's somewhat similar to what BalancedBaggingClassifier does. but anyway, I will try your solution. thanks for helping
Guillaume Lemaitre
@glemaitre
XGBoost is just the same as HistGradientBoostingClassifier from scikit-learn