Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • Aug 12 14:29
    intentodemusico edited #919
  • Aug 12 14:29
    intentodemusico edited #919
  • Aug 12 14:28
    intentodemusico edited #919
  • Aug 12 14:28
    intentodemusico edited #919
  • Aug 12 14:27
    intentodemusico opened #919
  • Aug 09 17:27
    hayesall commented #918
  • Aug 08 03:35
    miguelfmc commented #918
  • Aug 04 13:01
    anryabrahamyan opened #918
  • Aug 04 13:01
    anryabrahamyan labeled #918
  • Jul 28 12:27
    hayesall labeled #917
  • Jul 28 12:27
    hayesall commented #917
  • Jul 28 09:29
    glemaitre commented #729
  • Jul 28 00:31
    davidshumway opened #917
  • Jul 28 00:31
    davidshumway labeled #917
  • Jul 27 17:57
    davidshumway commented #729
  • Jul 27 17:56
    davidshumway commented #729
  • Jul 23 13:29
    nilesh05apr commented #916
  • Jul 19 18:59
    hayesall labeled #916
  • Jul 19 16:43
    gchlebus edited #916
  • Jul 19 16:43
    gchlebus opened #916
Guillaume Lemaitre
@glemaitre
so you could find these row by knowing the original size of X
player1024
@player1024
@glemaitre thank you , works like a charm
albattawi
@albattawi

Hello everyone, please i need help with TypeError: init() got an unexpected keyword argument 'random_state' when i try import imblearn , i install it by pip and conda, and using python 3.8 , >>> import imblearn

print(imblearn.version)
0.7.0

Christian Hacker
@christianhacker

Greetings. I'm working with some keras autoencoder models, and would like to use the imblearn keras batch generator with them. But imblearn samplers only work with targets that are either class labels or single-output continuous (regression); you get an error if you pass targets that are multi-output continuous. My datasets have class labels, but with autoencoders the targets are supposed to be the same input data that you pass to the model. A keras model function call would look like:

model.fit(X=X, y=X, ...)

The imblearn.keras batch generator can't do this. It doesn't seem like it would be too difficult, though; you still would need the class labels for the sampling strategy to work, but instead of passing the labels to the model, you just pass the input features as the targets as well.

Anyone have ideas on how to get this to work? Thank you.

thrylos2307
@thrylos2307
I am dealing with multiclass targeted values, i got the ValueError: "sampling_strategy" can be a float only when the type of target is binary. For multi-class, use a dict, while implementing RandomOverSampler ,it mentions to use dict what kind of dictionary is it referring to, i mean what kind of key and values should be the dict consist of for this ?
Andrea Lorenzon
@andrealorenzon
the dict should be keys = classes, values = required # of samples
{1:100, 2:100}
György Kovács
@gykovacs
Hi All, cost-sensitive learning is a large portion of all the imbalanced learning material, I'm into cost-sensitive learning with instance dependent cost matrices, but couldn't find a single paper about it. Everyone supposes that the instance level cost matrices are given. Have you came across any papers or methods estimating instance level cost matrices?
Guillaume Lemaitre
@glemaitre
usually the user is defining the costs
this is usually linked to the application and thus this is not a parameter to be optimized
You can imagine the following with credit-card fraud detection
in which false positive and false negative will not have the same cost
but the cost could be defined as a real cost in dollars
linked to the business side
György Kovács
@gykovacs
Yep, this makes total sense. On the flipside, instance-level "complicatedness"/"hard-to-learn-ness" is usually appearing in oversampling techniques, which might serve as the basis of local costs. I found this so obvious that I just wondered if I am searching for wrong terms to find anything like this.
Also, I could imagine some local density estimation based costs. If the density is smaller, then the cost of misclassification should be higher.
In order to balance for differences in the densities of the classes, not only balancing the difference in the number of samples.
As global, class cardinality proportional weights do.
PanZiwei
@PanZiwei
Hi is it possible to get the element index so that I can know what data comes from the original dataset in the SMOTE upsampled dataset?

For reference I am using the SMOTE method for oversampling:

smoter = SMOTE(random_state=42, n_jobs=-1, sampling_strategy = 'not majority')

X_train_smote, y_train_smote = smoter.fit_resample(X_train, y_train)

To be more specific, I am wondering whether it is possible to know the index for X_train in the X_train_smote dataset.

Guillaume Lemaitre
@glemaitre
nop this is currently not possible
Jan Zyśko
@FrugoFruit90
Has anyone used imblearn SMOTE together with Pipeline and some explainability framework? e.g. shap, lime, eli5
My problem is that I try to explain my predictions, but for that I need to first transform the data to the state of "just before fitting the last estimator" (because the transformers in the Pipeline like custom vectorizers create new columns) before running the explainability package together with the last step - the estimator. For that, I can in principle use pipeline[:-1].transform(np.array(X_train)). However, I then get the error "AttributeError: 'SMOTE' object has no attribute 'transform'". I don't know how to proceed.
Mustafa Aldemir
@mstfldmr
I get "could not convert string to float" error from SMOTE, but RandomOverSampler works well.
import numpy as np
from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import SMOTE

x = np.array([['aaa'] * 100, ['bbb'] * 100]).T
y = np.array([0] * 10 + [1] * 90)

ros = RandomOverSampler()
x_res, y_res = ros.fit_sample(x, y)

smote = SMOTE()
x_res, y_res = smote.fit_sample(x, y)
Guillaume Lemaitre
@glemaitre
because RandomUnderSampler does not use X to over sample
while SMOTE interpolate and therefore apply numerical operation
thus you need to encode your data in some way
Soledad Galli
@solegalli

Hello team. I have a question about Borderline SMOTE:
The variant 2 is supposed to interpolate between the minority in danger and other neighbors from the minority, and the minority in danger and some neighbors from the majority.

In line https://github.com/scikit-learn-contrib/imbalanced-learn/blob/4162d2d/imblearn/over_sampling/_smote.py#L352
we train a KNN only on the minority class and then derive the neighbors nns from it, which we use for the interpolation.

Then we use that nns to obtain the neighbors from the majority class in the second part (https://github.com/scikit-learn-contrib/imbalanced-learn/blob/4162d2d/imblearn/over_sampling/_smote.py#L397) of the borderline-2 code. But would not nns contain only neighbours from the minority? as it is derived from a knn trained only in the minority class?

Mustafa Aldemir
@mstfldmr

thus you need to encode your data in some way

I have only 1 column in X and it is text. How should I encode it?

Mustafa Aldemir
@mstfldmr
My data is not categorical, it's description text in free format
Soledad Galli
@solegalli
Random question, do you have any experience on how widely used MetaCost is? https://homes.cs.washington.edu/~pedrod/papers/kdd99.pdf, I mean to work with imbalanced data? Supposedly wraps any classifier. Do you know a good Python implementation or are you planing to make it part of imabalanced-learn?
Guillaume Lemaitre
@glemaitre
There is a plan to maybe include it in scikit-learn
Eleni Markou
@emarkou
Hello! Maybe a usage question... I am using SMOTE and trying to oversample the minority class to a specific ratio. So I am passing a float value between (0,1] to the sampling_strategy argument. No matter the value I set, even 1, I always get "ValueError: The specified ratio required to remove samples from the minority class while trying to generate new samples. Please increase the ratio."
The distribution of my target variable is the following so anything above 0.1 roughly I was expecting to be fine. Am I missing something?
image.png
Christos Aridas
@chkoar
@emarkou could you post an mre and you versions;
Christos Aridas
@chkoar
In any case, without an mre, I suppose that your target ratio is to low in order to generate new examples. In the oversampling case the sampling_strategy is proportionally to the majority class. Having said that, in the case of 90:10 you will need at least 0.13 in order to generate (and add) a single minority instance. So the new ratio will be 90:11.
Billel Mokeddem
@mokeddembillel
Hey guys, can anyone here review my issue (scikit-learn-contrib/imbalanced-learn#781), it's about adding a new feature to Condensed Nearest Neighbour and I want to start working on it but first I want to hear your opinion. thank you
Guillaume Lemaitre
@glemaitre
The issue is that AllKNN is applying ENN several time
so we can stop after a certain number of iteration based on some criteria
which could be considered as an early stopping
CNN is only doing a single iteration
We have an inner iteration that go through all samples to decide whether to exclude or not some samples
but stopping there would be a bad idea
You might treat only a specific area of your data distribution then which would not be beneficial I think
at least this mis my intuition
Probably, tuning the hyperparameters would be better in this case.
Billel Mokeddem
@mokeddembillel
yes, I think you're right, it would be a bad idea, but even with tuning the hyperparameters, it doesn't work in all situations I tried that, so I ended up using Random Undersampler and it yielded the best result