Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • Jan 25 09:34

    glemaitre on master

    DOC fix estimator documentation… (compare)

  • Jan 25 09:34
    glemaitre closed #890
  • Jan 25 09:34
    glemaitre closed #889
  • Jan 20 05:20
    chkoar commented #787
  • Jan 19 21:31
    GiuseppeMagazzu commented #787
  • Jan 17 22:32
    Ingkarat commented #889
  • Jan 17 21:45
    codecov[bot] commented #891
  • Jan 17 21:39
    codecov[bot] commented #891
  • Jan 17 21:39
    codecov[bot] commented #891
  • Jan 17 21:38
    codecov[bot] commented #891
  • Jan 17 21:36
    codecov[bot] commented #891
  • Jan 17 21:35
    codecov[bot] commented #891
  • Jan 17 21:35
    codecov[bot] commented #891
  • Jan 17 21:35
    codecov[bot] commented #891
  • Jan 17 21:34
    codecov[bot] commented #891
  • Jan 17 21:34
    codecov[bot] commented #891
  • Jan 17 21:33
    codecov[bot] commented #891
  • Jan 17 21:32
    codecov[bot] commented #891
  • Jan 17 21:25
    codecov[bot] commented #891
  • Jan 17 21:25
    glemaitre synchronize #891
Guillaume Lemaitre
@glemaitre
imblearn object are inheriting from BaseEstimator from scikit-learn. So I am not sure what do you mean by converting it?
which type of operation you would like to apply that is a blocker?
James Proctor
@j-proc
I actually just needed to convert/downcast the type for compatibility with an external package that only supports sklearn objects. I wouldn't be calling fit which is where I expect the major changes occur but I thought it might be possible via a less hacky way.
Guillaume Lemaitre
@glemaitre
But what is a sklearn object? Supposedly sklearn just provide the BaseEstimator class. Which check is done in the external package?
Dennis
@ydennisy
Hello All!
Quick question - what is the recommended way to grid search all samplers?
Guillaume Lemaitre
@glemaitre
If you are using a Pipeline then you can try different sampler in a scikit-learn grid-search or randomized search
I would probably search for the strategy_sampling parameter as well.
Dennis
@ydennisy
Thanks @glemaitre any refs to get started - using a pipeline normally I search the various params for each step, but how to switch out various samplers at each step?
Guillaume Lemaitre
@glemaitre
let me show a bit of code with only scikit-learn estimator and then I will mention the difference with samplers
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline


model = Pipeline(steps=[
    ("preprocessor", StandardScaler()),
    ("classifier", KNeighborsClassifier(n_neighbors=5)),
])
here you can define a pipeline with a preprocessor step
Then you can declare a list of potential preprocessor:
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import PowerTransformer


all_preprocessors = [
    None,
    StandardScaler(),
    MinMaxScaler(),
    QuantileTransformer(n_quantiles=100),
    PowerTransformer(method="box-cox"),
]
and thus defined the parameter grid with
param_grid = {
    “preprocessor”: all_preprocessors,
}
and create the subsequent grid-search
search_cv = GridSearchCV(model, param_grid=param_grid)
and it will try all preprocessor
So now you can use the same scheme with the sampler
you only need to use the Pipeline from imblearn.pipeline such that it can handle sampler
and declare a list of all potential samplers to try and pass it to the grid.
Dennis
@ydennisy
Wow!
That is neat! Would be great to have that on the docs :)
Thanks @glemaitre
Guillaume Lemaitre
@glemaitre
but I don’t remember if we have it anywhere documented
Dennis
@ydennisy
Thanks for your help and the MOOC link, not seen that before!
Guillaume Lemaitre
@glemaitre
Regarding the mooc we will have an open session at the beginning of next year ;)
Soledad Galli
@solegalli
I made a (bayesian) search over different over- and under- samplers, cost sensitive learning and specific ensemble methods with optuna: https://www.kaggle.com/solegalli/nested-hyperparameter-spaces-with-optuna. Feedback is welcome.
statcom
@statcom
I want to use a custom distance metric for undersampling nearest neighbor (NN) methods. For example, KNeighborsClassifier in sklearn has an argument 'metric' to specify your own distance metric between instances. But I couldn't find any way to do that with, for example, CondensedNearestNeighbor or fit_resample.
Guillaume Lemaitre
@glemaitre
n_neighbors accept an arbritrary scikit-learn KNeighborsClassifier. So you can create a scikit-learn object with the desired metric and plug it into the n_neighbors of CondensedNeatestNeighbor
statcom
@statcom
I am confused. n_neighbors contains K value for the classifier. What do you mean by "plug it into n_neighbors"? Well, on second thought, I may understand what you meant but the suggestion sounds like a hack. In that case, I will just change your script to accommodate the metric. Thanks for your answer.
Guillaume Lemaitre
@glemaitre
n_neighbors : int or estimator object, default=None so you can pass an estimator as:
from collections import Counter 
from sklearn.datasets import fetch_mldata 
from sklearn.neighbors import KNeighborsClassifier
from imblearn.under_sampling import CondensedNearestNeighbour 
pima = fetch_mldata('diabetes_scale') 
X, y = pima['data'], pima['target'] 
print('Original dataset shape %s' % Counter(y)) 
cnn = CondensedNearestNeighbour(random_state=42, n_neighbors=KNeighborsClassifier(metric=“euclidean"))
statcom
@statcom
Thanks a lot. With your answer, I was able to run the example. But I am having an issue with sampling_strategy for the NN methods. If I left it as default, sampling changes {0: 500, 1: 268} to {0:211, 1:268}, but if I change the option to "all", I get {0:211, 1:1}. But I want to maintain the sampling rate 0.5 so that my target sampling would be {0:250, 1:134}. Is there anyway to do that with NN methods?
Guillaume Lemaitre
@glemaitre
you cannot have the exact ratio that you want
You might control better using the n_neighbors of the KNeighborsClassifier(n_neighbors=2, metric=“euclidean”). By default, scikit-learn set it to 5. The original CNN is using a 1-NN rule. It explains why you get a really different results
statcom
@statcom
'all' sampling_strategy still produced {0: 145, 1: 1} with your suggestions. Your answer corresponds to what I figured from your code. My ultimate goal with these methods is not balancing data over classes, but selecting instances for a certain accuracy. So I don't need the exact ratio but similar numbers.
Guillaume Lemaitre
@glemaitre
In this case, you might want to grid-search the parameters to find the best imbalanced ratio
statcom
@statcom
Will try. Thanks!
Konrad
@KonuTech
Hello. First of all sorry if my question will not be specific enough. I wonder if an example from https://imbalanced-learn.org/stable/references/generated/imblearn.keras.BalancedBatchGenerator.html is not working with Tensorflow 2.7.0 and imbalanced-learn 0.8.1 . I understand that I have to install Tensorflow to get Keras. Thanks for reply in advance
Guillaume Lemaitre
@glemaitre
I am actually trying to fix several issue right now with the CI
and I will revise the compaibility with tensorflow and keras
You might want to downgrade tensorflow in the meanwhile or wait a bit for the upcoming release
Konrad
@KonuTech
Thanks
iyuiasd7
@iyuiasd7
hello, I tried to oversampling with ADASYN, but an exception was thrown(Not any neigbours belong to the majority class. This case will induce a NaN case with a division by zero. ADASYN is not suited for this specific dataset. Use SMOTE instead).
Guillaume Lemaitre
@glemaitre
Dis you try SMOTE then?
iyuiasd7
@iyuiasd7
Yes, SMOTE is working properly, not sure what caused it
"In fact, ADASYN focuses on generating samples next to the original samples which are wrongly classified using a k-Nearest Neighbors classifier while the basic implementation of SMOTE will not make any distinction between easy and hard samples to be classified using the nearest neighbors rule."
So if the part of the algorithm that try to find the difficult sample fail