Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • Sep 21 17:10
    balvisio commented #927
  • Sep 21 16:49
    rjurney commented #927
  • Sep 21 16:00
    codecov[bot] commented #927
  • Sep 21 15:53
    codecov[bot] commented #927
  • Sep 21 15:52
    balvisio synchronize #927
  • Sep 21 15:52
    codecov[bot] commented #927
  • Sep 21 15:51
    balvisio synchronize #927
  • Sep 21 15:18
    balvisio edited #927
  • Sep 21 15:17
    balvisio commented #707
  • Sep 21 15:16
    balvisio commented #340
  • Sep 21 05:17
    codecov[bot] commented #927
  • Sep 21 05:15
    balvisio edited #927
  • Sep 21 05:15
    balvisio edited #927
  • Sep 21 05:10
    balvisio opened #927
  • Sep 15 00:34
    chkoar commented #707
  • Sep 15 00:30
    balvisio commented #707
  • Sep 14 16:18
    github844268529 opened #926
  • Sep 14 07:50
    BrandonKMLee edited #925
  • Sep 14 07:50
    BrandonKMLee edited #925
  • Sep 13 13:53
    BrandonKMLee edited #925
Christos Aridas
@chkoar
Open a new issue with a minimum reproducible example in order to verify the behaviour.
Ilkin Bayramli
@ibayramli2001
Hi all! I just wanted to ask if BalancedRandomForestClassifier object resamples the test examples as well when the predict method is called. My guess is that it doesn't because there BalancedRandomForestClassifier does not have a predict method per se but inherits it from RandomForestClassifier which does not resample test examples (also, the prediction metrics like the precision and recall would be affected by it), but want to clarify it nevertheless.
joeltok
@joeltok

Hey all, Appreciating the great work that has been put in to make this library so easy to use.

Just a question: For SMOTE, when a sample is generated through oversampling, is there a way to link back to the original sample that was used to generate it (for example through some kind of id, or some fields that are guaranteed to be immutable by the over-sampler)?

Anushiya Thevapalan
@anushiya-thevapalan
Hi TypeError: __init__() got an unexpected keyword argument 'ratio I am getting this error when I simply execute the below code. sm = SMOTE(random_state=42, ratio=0.6). Any suggestions for what's going wrong?
Guillaume Lemaitre
@glemaitre
You should not use ratio anymore because it has been deprecated
and removed since version 0.5 I think
instead use sampling_strategy
You still have the compatibility with a float.
@joeltok We don't have this feature implemented
joeltok
@joeltok
Thanks for the reply. It is not a problem anymore, I cloned the repository and made some code changes to shoehorn the "feature" in. Are you guys looking at introducing this as a feature?
Anushiya Thevapalan
@anushiya-thevapalan
Thanks @glemaitre . It works
Guillaume Lemaitre
@glemaitre
@joeltok We might include a parameter which would store the indices of the 2 samples as an attribute and have it at False as a default.
The only thing is that it should consistent across all the SMOTE variants
joeltok
@joeltok
@glemaitre Thank you.
player1024
@player1024
Hi everyone, anyone knows how to extract just the new synthetic rows in a dataframe after running SMOTE, smote_tomek, etc ?
player1024
@player1024

Hey all, Appreciating the great work that has been put in to make this library so easy to use.

Just a question: For SMOTE, when a sample is generated through oversampling, is there a way to link back to the original sample that was used to generate it (for example through some kind of id, or some fields that are guaranteed to be immutable by the over-sampler)?

similarly to what I just asked above - I just saw this - again, how do we know which rows were actually generated by SMOTE and how can we extract them?

Guillaume Lemaitre
@glemaitre
On the last question, you cannot for the moment
For the first question
the new rows are concatenated at the end of the original X
so you could find these row by knowing the original size of X
player1024
@player1024
@glemaitre thank you , works like a charm
albattawi
@albattawi

Hello everyone, please i need help with TypeError: init() got an unexpected keyword argument 'random_state' when i try import imblearn , i install it by pip and conda, and using python 3.8 , >>> import imblearn

print(imblearn.version)
0.7.0

Christian Hacker
@christianhacker

Greetings. I'm working with some keras autoencoder models, and would like to use the imblearn keras batch generator with them. But imblearn samplers only work with targets that are either class labels or single-output continuous (regression); you get an error if you pass targets that are multi-output continuous. My datasets have class labels, but with autoencoders the targets are supposed to be the same input data that you pass to the model. A keras model function call would look like:

model.fit(X=X, y=X, ...)

The imblearn.keras batch generator can't do this. It doesn't seem like it would be too difficult, though; you still would need the class labels for the sampling strategy to work, but instead of passing the labels to the model, you just pass the input features as the targets as well.

Anyone have ideas on how to get this to work? Thank you.

thrylos2307
@thrylos2307
I am dealing with multiclass targeted values, i got the ValueError: "sampling_strategy" can be a float only when the type of target is binary. For multi-class, use a dict, while implementing RandomOverSampler ,it mentions to use dict what kind of dictionary is it referring to, i mean what kind of key and values should be the dict consist of for this ?
Andrea Lorenzon
@andrealorenzon
the dict should be keys = classes, values = required # of samples
{1:100, 2:100}
György Kovács
@gykovacs
Hi All, cost-sensitive learning is a large portion of all the imbalanced learning material, I'm into cost-sensitive learning with instance dependent cost matrices, but couldn't find a single paper about it. Everyone supposes that the instance level cost matrices are given. Have you came across any papers or methods estimating instance level cost matrices?
Guillaume Lemaitre
@glemaitre
usually the user is defining the costs
this is usually linked to the application and thus this is not a parameter to be optimized
You can imagine the following with credit-card fraud detection
in which false positive and false negative will not have the same cost
but the cost could be defined as a real cost in dollars
linked to the business side
György Kovács
@gykovacs
Yep, this makes total sense. On the flipside, instance-level "complicatedness"/"hard-to-learn-ness" is usually appearing in oversampling techniques, which might serve as the basis of local costs. I found this so obvious that I just wondered if I am searching for wrong terms to find anything like this.
Also, I could imagine some local density estimation based costs. If the density is smaller, then the cost of misclassification should be higher.
In order to balance for differences in the densities of the classes, not only balancing the difference in the number of samples.
As global, class cardinality proportional weights do.
PanZiwei
@PanZiwei
Hi is it possible to get the element index so that I can know what data comes from the original dataset in the SMOTE upsampled dataset?

For reference I am using the SMOTE method for oversampling:

smoter = SMOTE(random_state=42, n_jobs=-1, sampling_strategy = 'not majority')

X_train_smote, y_train_smote = smoter.fit_resample(X_train, y_train)

To be more specific, I am wondering whether it is possible to know the index for X_train in the X_train_smote dataset.

Guillaume Lemaitre
@glemaitre
nop this is currently not possible
Jan Zyśko
@FrugoFruit90
Has anyone used imblearn SMOTE together with Pipeline and some explainability framework? e.g. shap, lime, eli5
My problem is that I try to explain my predictions, but for that I need to first transform the data to the state of "just before fitting the last estimator" (because the transformers in the Pipeline like custom vectorizers create new columns) before running the explainability package together with the last step - the estimator. For that, I can in principle use pipeline[:-1].transform(np.array(X_train)). However, I then get the error "AttributeError: 'SMOTE' object has no attribute 'transform'". I don't know how to proceed.
Mustafa Aldemir
@mstfldmr
I get "could not convert string to float" error from SMOTE, but RandomOverSampler works well.
import numpy as np
from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import SMOTE

x = np.array([['aaa'] * 100, ['bbb'] * 100]).T
y = np.array([0] * 10 + [1] * 90)

ros = RandomOverSampler()
x_res, y_res = ros.fit_sample(x, y)

smote = SMOTE()
x_res, y_res = smote.fit_sample(x, y)
Guillaume Lemaitre
@glemaitre
because RandomUnderSampler does not use X to over sample
while SMOTE interpolate and therefore apply numerical operation
thus you need to encode your data in some way
Soledad Galli
@solegalli

Hello team. I have a question about Borderline SMOTE:
The variant 2 is supposed to interpolate between the minority in danger and other neighbors from the minority, and the minority in danger and some neighbors from the majority.

In line https://github.com/scikit-learn-contrib/imbalanced-learn/blob/4162d2d/imblearn/over_sampling/_smote.py#L352
we train a KNN only on the minority class and then derive the neighbors nns from it, which we use for the interpolation.

Then we use that nns to obtain the neighbors from the majority class in the second part (https://github.com/scikit-learn-contrib/imbalanced-learn/blob/4162d2d/imblearn/over_sampling/_smote.py#L397) of the borderline-2 code. But would not nns contain only neighbours from the minority? as it is derived from a knn trained only in the minority class?

Mustafa Aldemir
@mstfldmr

thus you need to encode your data in some way

I have only 1 column in X and it is text. How should I encode it?