Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • Sep 16 18:41
    codecov[bot] commented #858
  • Sep 16 18:41
    pep8speaks commented #858
  • Sep 16 18:41
    NV-jpt synchronize #858
  • Sep 15 17:10
    pep8speaks commented #858
  • Sep 15 17:10
    codecov[bot] commented #858
  • Sep 15 17:10
    NV-jpt synchronize #858
  • Sep 14 14:18
    NV-jpt edited #858
  • Sep 13 21:53
    NV-jpt edited #858
  • Sep 13 21:30
    NV-jpt edited #858
  • Sep 13 21:28
    NV-jpt edited #858
  • Sep 13 21:28
    NV-jpt edited #858
  • Sep 13 19:37
    pep8speaks commented #858
  • Sep 13 19:37
    codecov[bot] commented #858
  • Sep 13 19:37
    NV-jpt synchronize #858
  • Sep 13 19:36
    codecov[bot] commented #858
  • Sep 13 19:36
    pep8speaks commented #858
  • Sep 13 19:36
    NV-jpt synchronize #858
  • Sep 13 19:33
    pep8speaks commented #858
  • Sep 13 19:33
    codecov[bot] commented #858
  • Sep 13 19:33
    NV-jpt synchronize #858
Guillaume Lemaitre
@glemaitre
It is possible that they are randomized
mcihat
@mcihat
Ok got it.
What could be the reason of same prediction results for my dataset which I mentioned above no matter what I choose for the parameter "n_estimators"? Choosing 1000 or 1 no differs.
Ghost
@ghost~605a144b6da037398476ba75
I'm having a weird problem with random oversampler. Running two python scripts on what is near identical data from two different sources. Getting a value error: can convert string to float for my one text-based feature. The feature formatting and number of unique values are the same in both sets. In one script it works and in one script I get the error. This apparently was an issue years ago - imbalanced learn didn't support text of pandas dataframes. I believe that has now been fixed (evidently since one of my scripts works). Guidance on how to handle? As mentioned, I have confirmed that formatting and values on the problem feature are the same. Thanks
Guillaume Lemaitre
@glemaitre
I cannot say without a code snippet or the head of the dataset
but normally the RandomOverSampler will get a dataset and will not be an issue to have non-numerical data inside
Ghost
@ghost~605a144b6da037398476ba75
Thanks for your reply. I tend to agree that it is something in my data vs. something in the library since it runs properly in one script.
Here is another wildcard question - I'll check myself but if someone has the answer off the top of the head it will save me some work - when oversampling and the new observations are created, are they appended to the bottom/end of the dataframe/array or are they placed adjacent to the observation from which it was created?
There we copy the original dataset
and then append new samples for each class
MariMari7
@MariMari7

I'm using a multiclass dataset (cic-ids-2017), the target column is categorical (more than 4 classes), I used {pd.get_dummies} for One Hot Encoding. The dataset is very imbalanced, and when I tried to oversampling it using SMOTE method, doesn't work, I also tried to include them into a pipeline, but the pipeline cannot support get_dummies, I replaced it by OneHotEncoder, unfortunately, still not working :

X = dataset.drop(['Label'],1)
y = dataset.Label
steps = [('onehot', OneHotEncoder(), ('smt', SMOTE())]
pipeline = Pipeline(steps=steps)
X, y = pipeline.fit_resample(X, y)
Is there any proposition ?

MariMari7
@MariMari7
My correlation matrix does not changed after using SMOTE, what could be the cause ?
HugoTex98
@HugoTex98
Hello!
I'm using resting-state fMRI correlation matrices, which are 4D, and I want to use SMOTE+ENN but it only allows me to use 2D data... How can I adress this problem without losing information from my original data?
Thanks!
Akilu Rilwan Muhammad
@arilwan

This question got to do with SMOTEBoost implementation found here https://github.com/gkapatai/MaatPy but I believe the issue is relayed to imblearn library.

I tried using the library to re-sample all classes in a multiclass problem. Caught by AttributeError: 'int' object has no attribute 'flatten' error:

How to reproduce (in Colab nb):
Clone repo:

!git clone https://github.com/gkapatai/MaatPy.git
cd MaatPy/

from maatpy.classifiers import SMOTEBoost

Dummy data:

X, y = make_classification(n_samples=1000, n_classes=3, n_informative=6, weights=[.1, .15, .75])
xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=.2, random_state=123)

And then:

from maatpy.classifiers import SMOTEBoost
model = SMOTEBoost()
model.fit(xtrain, ytrain)

/usr/local/lib/python3.7/dist-packages/imblearn/over_sampling/_smote.py in _make_samples(self, X, y_dtype, y_type, nn_data, nn_num, n_samples, step_size)
    106         random_state = check_random_state(self.random_state)
    107         samples_indices = random_state.randint(
--> 108             low=0, high=len(nn_num.flatten()), size=n_samples)
    109         steps = step_size * random_state.uniform(size=n_samples)
    110         rows = np.floor_divide(samples_indices, nn_num.shape[1])

AttributeError: 'int' object has no attribute 'flatten'
krinetic1234
@krinetic1234
hi i have a question
when using SMOTE, i get this ValueError: Found array with dim 4. Estimator expected <= 2.
its a binary class problem
CNN
not sure how to fix
pls help
HugoTex98
@HugoTex98
@krinetic1234 as far as I'm concerned, SMOTE only works for 2D data... I have the same problem and I don't know how to solve
krinetic1234
@krinetic1234
interesting, yea I have a CSV of grayscale images basically where each image is 224 by 224
so in that case it wouldn't work..
is there an alternative to SMOTE That works well for images?
apparently you just like multiply the image stuff
and then reshape back
idk how well it'll work though
MariMari7
@MariMari7

I used SMOTEENN and SMOTE Tomek in my initial data, they take between 1,5 and 2,5 hours. But when I added some data, they run 5 hours before I interrupted them.

  • Initial data : 49,77 MB
  • Added data : 79,25 MB

  • All data = 129,02 MB

NB. SMOTE take just some second for All data.

HugoTex98
@HugoTex98
Really interesting @krinetic1234... but using that reshape will not cause loss of information?
krinetic1234
@krinetic1234
i thought so too.... do any of you know of a better way
to do this "SMOTE" idea for images
and btw i tried the reshape and it dint rly work properly
Akilu Rilwan Muhammad
@arilwan

You just need to operate proper reshaping. I once worked with a time series activity data in which I created chunks of N-size time-steps. The shape of my input was (1, 100, 4). So for the training sample, I have (n_samples, 1, 100, 4) and was a five-class, multi-minority problem, that I want to oversample using SMOTE.

The way I go about it was to flatten the input, like so:

#..reshape (flatten) Train_X for SMOTE resanpling
nsamples, k, nx, ny = Train_X.shape
#Train_X = Train_X.reshape((nsamples,nx*ny))

#smote = SMOTE('not majority', random_state=42, k_neighbors=5)
#X_reample, Y_resample = smote.fit_sample(Train_X, Train_Y)

And then reshape the instance back to the original input shape, like so:

#..reshape input back to CNN xture
X_reample = X_reample.reshape(len(X_reample), k, nx, ny)
krinetic1234
@krinetic1234
ok but does SMOTE actually augment images? @arilwan
like lets say that i have tons of images of cats and few images of dogs, does it actually augment the dog images? and if so, how does it oversample those?
i haven't seen much where people use SMOTE for oversampling images specifically, which is why im surprised
thanks by the way, i'll definitely check what you sent
i believe i did somethign simliar but got an error
krinetic1234
@krinetic1234
Screen Shot 2021-06-28 at 10.54.48 PM.png
so i was also wondering something
i tried to do that and it didn't work
do you have any advice on what to do differently?
i didnt do exactly how you did but thoguht this was a simpler appraoch, conceptually-wise
Soledad Galli
@solegalli
In the instance hardness threshold, when in the docs says "InstanceHardnessThreshold is a specific algorithm in which a classifier is trained on the data and the samples with lower probabilities are removed"(https://imbalanced-learn.org/stable/under_sampling.html#instance-hardness-threshold) what probability exactly is it referring to? The probability of the majority? or the probability of the minority? It is not clear to me from the docs.
Assuming that the target has 2 classes, 0 and 1, and 1 is the minority class: cross_val_predict will return an array with the probabilities of class 0 and 1. Then the code takes the first vector (https://github.com/scikit-learn-contrib/imbalanced-learn/blob/f177b05/imblearn/under_sampling/_prototype_selection/_instance_hardness_threshold.py#L156), that is the probability of being of the majority class, and keeps those with the highest probability, so those that are easier to classify correctly as members of the majority. So far, I think I understand.
But if the target has 3 classes, 0, 1 and 2, and only 2 is the minority, the code (https://github.com/scikit-learn-contrib/imbalanced-learn/blob/f177b05/imblearn/under_sampling/_prototype_selection/_instance_hardness_threshold.py#L156) will only take the first vector of probabilities, that is of class 0. But for class 1, should it not be taking the second vector? is this a bug? or am I understanding the code wrongly?
Amila Wickramasinghe
@AmilaSamith
I am trying to use a customize generator. But it gives the following error :
try:
import keras
ParentClass = keras.utils.Sequence
HAS_KERAS = True
except ImportError: AttributeError: module 'keras.utils' has no attribute 'Sequence' in _generator.py file. What can I do to overcome this error?
Soledad Galli
@solegalli
In Random Oversampling, when applying shrinkage we multiply the std of each variable, by the srhinkage (arbitrary and entered by the user) a smoothing_constant. The smoothing_constant is (4 / ((n_features + 2) n_samples)) ** (1 / (n_features + 4). What is the logic for this constant?
Also, as per the docs on RandomOversampling, "When generating a smoothed bootstrap, this method is also known as Random Over-Sampling Examples (ROSE) [1]." But, in the ROSE paper, do the authors not select samples which probability is 1/2? In RandomOversampling the smoothing is applied to all randomly extracted samples, regardless of their original probability.