Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • Oct 17 22:00
    codecov[bot] commented #869
  • Oct 17 21:52
    glemaitre edited #869
  • Oct 17 21:50
    codecov[bot] commented #869
  • Oct 17 21:50
    glemaitre synchronize #869
  • Oct 17 21:45
    codecov[bot] commented #869
  • Oct 17 21:45
    glemaitre synchronize #869
  • Oct 17 20:44
    codecov[bot] commented #869
  • Oct 17 20:44
    codecov[bot] commented #869
  • Oct 17 20:43
    codecov[bot] commented #869
  • Oct 17 20:42
    codecov[bot] commented #869
  • Oct 17 20:41
    codecov[bot] commented #869
  • Oct 17 20:33
    codecov[bot] commented #869
  • Oct 17 20:33
    glemaitre synchronize #869
  • Oct 17 18:20
    codecov[bot] commented #869
  • Oct 17 18:18
    codecov[bot] commented #869
  • Oct 17 18:16
    codecov[bot] commented #869
  • Oct 17 18:16
    codecov[bot] commented #869
  • Oct 17 18:14
    codecov[bot] commented #869
  • Oct 17 18:05
    codecov[bot] commented #869
  • Oct 17 18:05
    glemaitre synchronize #869
Guillaume Lemaitre
@glemaitre
If I recall properly, they leverage sample_weight and therefore you would need to have a Sampler that store indices to build the sample_weight vector
The second consideration is computational performance
Random US/OS are not costly
adding sampler based on k-NN will not scale
and in practice, I am tending to think that RUS and ROS would be enough to alleviate the issue with an ensemble learner.
Soledad Galli
@solegalli
makes sense, thank you!
Hanchung Lee
@leehanchung

Hi,

I am getting error loading a trained imblearn.pipeline Pipeline saved by joblib. Getting this error message:

ModuleNotFoundError: No module named 'imblearn.over_sampling._smote.base'; 'imblearn.over_sampling._smote' is not a package

The trained pipeline was saved via joblib.dump(pipeline, 'filename.joblib'). Any tips as to where the saving and loading process went wrong?

Guillaume Lemaitre
@glemaitre
make sure that the version installed is the same as the version used to pickle
mcihat
@mcihat

Hello everyone.
I have an usage question about EasyEnsembleClassifier. I have a dataset which has 450.000 data inputs with 13 columns(12 features, 1 target). My dataset is imbalanced (1:50) so I decided to use EasyEnsembleClassifier. I realized that all the subsets are exactly same for all the estimators.
I found this issue which is similar to my problem: scikit-learn-contrib/imbalanced-learn#116
In theory classifier method should create subsets for each estimators. These subsets should have all minority class samples and select same number of samples from majority class. In my case I should have roughly 18000 samples in each subset (I have roughly 9000 samples in minority class). However when I use "estimatorssamples" method it seems like output arrays for my estimators are exactly same and all of them have size of complete training set(80% of my dataset). So I decided to make a test:
'''
import numpy as np
from sklearn.datasets import make_classification
from imblearn.ensemble import EasyEnsembleClassifier

X, y = make_classification(n_classes=2, class_sep=2, weights=[0.3, 0.7],
n_informative=3, n_redundant=1, flip_y=0,
n_features=20, n_clusters_per_class=1,
n_samples=10, random_state=1)

clf = EasyEnsembleClassifier(n_estimators=5, n_jobs=-1)

clf.fit(X, y)

arr = clf.estimatorssamples
arr

Output:
[array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])]
'''

What am I doing wrong here? Obviously I am missing a point.

Guillaume Lemaitre
@glemaitre
To check the sample used by each estimator, you should use
In [13]: for est in clf.estimators_:
    ...:     print(est[0].sample_indices_)
[4 6 7 0 9 2]
[4 6 7 5 8 3]
[4 6 7 1 2 5]
[4 6 7 3 1 5]
[4 6 7 3 5 2]
I am not sure what estimator_samples_ is reporting. It might be a bug then
This weird that we don't document it
oh I see, we should add it in the documentation
Guillaume Lemaitre
@glemaitre
estimator_samples_ gives the samples dispatch to the first estimator that later on will undersample
This attribute exist because we inherit from the BaggingClassifier from scikit-learn
mcihat
@mcihat

The code you provided works fine with my generated dataset but when I use it on my real dataset this is what I get:

clf = EasyEnsembleClassifier(n_estimators=5, n_jobs=-1, sampling_strategy = 1.0)
clf.fit(X_train, y_train)
for est in clf.estimators_:
    print(est[0].sample_indices_)

Output:
[279507 240017  23859 ...  94249  87790 120830]
[277730  75855  70104 ... 341432 318980 130029]
[166614    207  72374 ...  93568  76905 142951]
[304630  28272 143132 ... 159062 264981  41332]
[ 35943 358917  68200 ... 121931 209190 284075]

Is this a normal result? I would expect first three indices in each row to be the same. I mean; all of the samples that belong to the minority class are being used in all subsets. I am not saying this is wrong. I am just asking if this is normal?

mcihat
@mcihat
Code runs just fine now. Thanks for help @glemaitre
Guillaume Lemaitre
@glemaitre
It is possible that they are randomized
mcihat
@mcihat
Ok got it.
What could be the reason of same prediction results for my dataset which I mentioned above no matter what I choose for the parameter "n_estimators"? Choosing 1000 or 1 no differs.
Ghost
@ghost~605a144b6da037398476ba75
I'm having a weird problem with random oversampler. Running two python scripts on what is near identical data from two different sources. Getting a value error: can convert string to float for my one text-based feature. The feature formatting and number of unique values are the same in both sets. In one script it works and in one script I get the error. This apparently was an issue years ago - imbalanced learn didn't support text of pandas dataframes. I believe that has now been fixed (evidently since one of my scripts works). Guidance on how to handle? As mentioned, I have confirmed that formatting and values on the problem feature are the same. Thanks
Guillaume Lemaitre
@glemaitre
I cannot say without a code snippet or the head of the dataset
but normally the RandomOverSampler will get a dataset and will not be an issue to have non-numerical data inside
Ghost
@ghost~605a144b6da037398476ba75
Thanks for your reply. I tend to agree that it is something in my data vs. something in the library since it runs properly in one script.
Here is another wildcard question - I'll check myself but if someone has the answer off the top of the head it will save me some work - when oversampling and the new observations are created, are they appended to the bottom/end of the dataframe/array or are they placed adjacent to the observation from which it was created?
There we copy the original dataset
and then append new samples for each class
MariMari7
@MariMari7

I'm using a multiclass dataset (cic-ids-2017), the target column is categorical (more than 4 classes), I used {pd.get_dummies} for One Hot Encoding. The dataset is very imbalanced, and when I tried to oversampling it using SMOTE method, doesn't work, I also tried to include them into a pipeline, but the pipeline cannot support get_dummies, I replaced it by OneHotEncoder, unfortunately, still not working :

X = dataset.drop(['Label'],1)
y = dataset.Label
steps = [('onehot', OneHotEncoder(), ('smt', SMOTE())]
pipeline = Pipeline(steps=steps)
X, y = pipeline.fit_resample(X, y)
Is there any proposition ?

MariMari7
@MariMari7
My correlation matrix does not changed after using SMOTE, what could be the cause ?
HugoTex98
@HugoTex98
Hello!
I'm using resting-state fMRI correlation matrices, which are 4D, and I want to use SMOTE+ENN but it only allows me to use 2D data... How can I adress this problem without losing information from my original data?
Thanks!
Akilu Rilwan Muhammad
@arilwan

This question got to do with SMOTEBoost implementation found here https://github.com/gkapatai/MaatPy but I believe the issue is relayed to imblearn library.

I tried using the library to re-sample all classes in a multiclass problem. Caught by AttributeError: 'int' object has no attribute 'flatten' error:

How to reproduce (in Colab nb):
Clone repo:

!git clone https://github.com/gkapatai/MaatPy.git
cd MaatPy/

from maatpy.classifiers import SMOTEBoost

Dummy data:

X, y = make_classification(n_samples=1000, n_classes=3, n_informative=6, weights=[.1, .15, .75])
xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=.2, random_state=123)

And then:

from maatpy.classifiers import SMOTEBoost
model = SMOTEBoost()
model.fit(xtrain, ytrain)

/usr/local/lib/python3.7/dist-packages/imblearn/over_sampling/_smote.py in _make_samples(self, X, y_dtype, y_type, nn_data, nn_num, n_samples, step_size)
    106         random_state = check_random_state(self.random_state)
    107         samples_indices = random_state.randint(
--> 108             low=0, high=len(nn_num.flatten()), size=n_samples)
    109         steps = step_size * random_state.uniform(size=n_samples)
    110         rows = np.floor_divide(samples_indices, nn_num.shape[1])

AttributeError: 'int' object has no attribute 'flatten'
krinetic1234
@krinetic1234
hi i have a question
when using SMOTE, i get this ValueError: Found array with dim 4. Estimator expected <= 2.
its a binary class problem
CNN
not sure how to fix
pls help
HugoTex98
@HugoTex98
@krinetic1234 as far as I'm concerned, SMOTE only works for 2D data... I have the same problem and I don't know how to solve
krinetic1234
@krinetic1234
interesting, yea I have a CSV of grayscale images basically where each image is 224 by 224
so in that case it wouldn't work..
is there an alternative to SMOTE That works well for images?
apparently you just like multiply the image stuff
and then reshape back
idk how well it'll work though
MariMari7
@MariMari7

I used SMOTEENN and SMOTE Tomek in my initial data, they take between 1,5 and 2,5 hours. But when I added some data, they run 5 hours before I interrupted them.

  • Initial data : 49,77 MB
  • Added data : 79,25 MB

  • All data = 129,02 MB

NB. SMOTE take just some second for All data.

HugoTex98
@HugoTex98
Really interesting @krinetic1234... but using that reshape will not cause loss of information?
krinetic1234
@krinetic1234
i thought so too.... do any of you know of a better way