Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • Dec 02 16:56
    jox79 commented #837
  • Dec 01 09:39
    szperajacyzolw commented #742
  • Nov 30 02:22
    darrencl edited #874
  • Nov 30 02:22
    darrencl opened #874
  • Nov 29 18:36
    ridhachahed commented #777
  • Nov 29 16:50
    glemaitre commented #777
  • Nov 29 16:45
    ridhachahed commented #777
  • Nov 27 15:30

    glemaitre on master

    Update MANIFEST.in (compare)

  • Nov 26 16:21
    nurrrrx commented #817
  • Nov 26 16:20
    nurrrrx commented #817
  • Nov 26 16:19
    nurrrrx commented #817
  • Nov 25 01:00
    leaphan commented #872
  • Nov 24 18:45
    YasCoMa commented #872
  • Nov 24 16:01
    YasCoMa commented #872
  • Nov 24 14:26
    leaphan commented #872
  • Nov 22 13:32
    glemaitre commented #858
  • Nov 18 18:00
    NV-jpt commented #858
  • Nov 17 14:20
    tinaty commented #669
  • Nov 16 11:16
    rrfaria commented #856
  • Nov 14 04:27
    rrfaria edited #873
krinetic1234
@krinetic1234
i thought so too.... do any of you know of a better way
to do this "SMOTE" idea for images
and btw i tried the reshape and it dint rly work properly
Akilu Rilwan Muhammad
@arilwan

You just need to operate proper reshaping. I once worked with a time series activity data in which I created chunks of N-size time-steps. The shape of my input was (1, 100, 4). So for the training sample, I have (n_samples, 1, 100, 4) and was a five-class, multi-minority problem, that I want to oversample using SMOTE.

The way I go about it was to flatten the input, like so:

#..reshape (flatten) Train_X for SMOTE resanpling
nsamples, k, nx, ny = Train_X.shape
#Train_X = Train_X.reshape((nsamples,nx*ny))

#smote = SMOTE('not majority', random_state=42, k_neighbors=5)
#X_reample, Y_resample = smote.fit_sample(Train_X, Train_Y)

And then reshape the instance back to the original input shape, like so:

#..reshape input back to CNN xture
X_reample = X_reample.reshape(len(X_reample), k, nx, ny)
krinetic1234
@krinetic1234
ok but does SMOTE actually augment images? @arilwan
like lets say that i have tons of images of cats and few images of dogs, does it actually augment the dog images? and if so, how does it oversample those?
i haven't seen much where people use SMOTE for oversampling images specifically, which is why im surprised
thanks by the way, i'll definitely check what you sent
i believe i did somethign simliar but got an error
krinetic1234
@krinetic1234
Screen Shot 2021-06-28 at 10.54.48 PM.png
so i was also wondering something
i tried to do that and it didn't work
do you have any advice on what to do differently?
i didnt do exactly how you did but thoguht this was a simpler appraoch, conceptually-wise
Soledad Galli
@solegalli
In the instance hardness threshold, when in the docs says "InstanceHardnessThreshold is a specific algorithm in which a classifier is trained on the data and the samples with lower probabilities are removed"(https://imbalanced-learn.org/stable/under_sampling.html#instance-hardness-threshold) what probability exactly is it referring to? The probability of the majority? or the probability of the minority? It is not clear to me from the docs.
Assuming that the target has 2 classes, 0 and 1, and 1 is the minority class: cross_val_predict will return an array with the probabilities of class 0 and 1. Then the code takes the first vector (https://github.com/scikit-learn-contrib/imbalanced-learn/blob/f177b05/imblearn/under_sampling/_prototype_selection/_instance_hardness_threshold.py#L156), that is the probability of being of the majority class, and keeps those with the highest probability, so those that are easier to classify correctly as members of the majority. So far, I think I understand.
But if the target has 3 classes, 0, 1 and 2, and only 2 is the minority, the code (https://github.com/scikit-learn-contrib/imbalanced-learn/blob/f177b05/imblearn/under_sampling/_prototype_selection/_instance_hardness_threshold.py#L156) will only take the first vector of probabilities, that is of class 0. But for class 1, should it not be taking the second vector? is this a bug? or am I understanding the code wrongly?
Amila Wickramasinghe
@AmilaSamith
I am trying to use a customize generator. But it gives the following error :
try:
import keras
ParentClass = keras.utils.Sequence
HAS_KERAS = True
except ImportError: AttributeError: module 'keras.utils' has no attribute 'Sequence' in _generator.py file. What can I do to overcome this error?
Soledad Galli
@solegalli
In Random Oversampling, when applying shrinkage we multiply the std of each variable, by the srhinkage (arbitrary and entered by the user) a smoothing_constant. The smoothing_constant is (4 / ((n_features + 2) n_samples)) ** (1 / (n_features + 4). What is the logic for this constant?
Also, as per the docs on RandomOversampling, "When generating a smoothed bootstrap, this method is also known as Random Over-Sampling Examples (ROSE) [1]." But, in the ROSE paper, do the authors not select samples which probability is 1/2? In RandomOversampling the smoothing is applied to all randomly extracted samples, regardless of their original probability.
James Proctor
@j-proc
Is there a simpler/built-in way to convert/cast imblearn objects to their sklearn object base/equivalent? My workaround is to create the base using all matching values from the object's dict and then once created update the sklearn object's dict with the imblearn object's dict.
Guillaume Lemaitre
@glemaitre
imblearn object are inheriting from BaseEstimator from scikit-learn. So I am not sure what do you mean by converting it?
which type of operation you would like to apply that is a blocker?
James Proctor
@j-proc
I actually just needed to convert/downcast the type for compatibility with an external package that only supports sklearn objects. I wouldn't be calling fit which is where I expect the major changes occur but I thought it might be possible via a less hacky way.
Guillaume Lemaitre
@glemaitre
But what is a sklearn object? Supposedly sklearn just provide the BaseEstimator class. Which check is done in the external package?
Dennis
@ydennisy
Hello All!
Quick question - what is the recommended way to grid search all samplers?
Guillaume Lemaitre
@glemaitre
If you are using a Pipeline then you can try different sampler in a scikit-learn grid-search or randomized search
I would probably search for the strategy_sampling parameter as well.
Dennis
@ydennisy
Thanks @glemaitre any refs to get started - using a pipeline normally I search the various params for each step, but how to switch out various samplers at each step?
Guillaume Lemaitre
@glemaitre
let me show a bit of code with only scikit-learn estimator and then I will mention the difference with samplers
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline


model = Pipeline(steps=[
    ("preprocessor", StandardScaler()),
    ("classifier", KNeighborsClassifier(n_neighbors=5)),
])
here you can define a pipeline with a preprocessor step
Then you can declare a list of potential preprocessor:
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import PowerTransformer


all_preprocessors = [
    None,
    StandardScaler(),
    MinMaxScaler(),
    QuantileTransformer(n_quantiles=100),
    PowerTransformer(method="box-cox"),
]
and thus defined the parameter grid with
param_grid = {
    “preprocessor”: all_preprocessors,
}
and create the subsequent grid-search
search_cv = GridSearchCV(model, param_grid=param_grid)
and it will try all preprocessor
So now you can use the same scheme with the sampler
you only need to use the Pipeline from imblearn.pipeline such that it can handle sampler
and declare a list of all potential samplers to try and pass it to the grid.
Dennis
@ydennisy
Wow!
That is neat! Would be great to have that on the docs :)
Thanks @glemaitre
Guillaume Lemaitre
@glemaitre
but I don’t remember if we have it anywhere documented
Dennis
@ydennisy
Thanks for your help and the MOOC link, not seen that before!
Guillaume Lemaitre
@glemaitre
Regarding the mooc we will have an open session at the beginning of next year ;)
Soledad Galli
@solegalli
I made a (bayesian) search over different over- and under- samplers, cost sensitive learning and specific ensemble methods with optuna: https://www.kaggle.com/solegalli/nested-hyperparameter-spaces-with-optuna. Feedback is welcome.
statcom
@statcom
I want to use a custom distance metric for undersampling nearest neighbor (NN) methods. For example, KNeighborsClassifier in sklearn has an argument 'metric' to specify your own distance metric between instances. But I couldn't find any way to do that with, for example, CondensedNearestNeighbor or fit_resample.
Guillaume Lemaitre
@glemaitre
n_neighbors accept an arbritrary scikit-learn KNeighborsClassifier. So you can create a scikit-learn object with the desired metric and plug it into the n_neighbors of CondensedNeatestNeighbor