Python module to perform under sampling and over sampling with various techniques.
strategy_sampling
parameter as well.
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
model = Pipeline(steps=[
("preprocessor", StandardScaler()),
("classifier", KNeighborsClassifier(n_neighbors=5)),
])
preprocessor
step
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import PowerTransformer
all_preprocessors = [
None,
StandardScaler(),
MinMaxScaler(),
QuantileTransformer(n_quantiles=100),
PowerTransformer(method="box-cox"),
]
param_grid = {
“preprocessor”: all_preprocessors,
}
search_cv = GridSearchCV(model, param_grid=param_grid)
Pipeline
from imblearn.pipeline
such that it can handle sampler
from collections import Counter
from sklearn.datasets import fetch_mldata
from sklearn.neighbors import KNeighborsClassifier
from imblearn.under_sampling import CondensedNearestNeighbour
pima = fetch_mldata('diabetes_scale')
X, y = pima['data'], pima['target']
print('Original dataset shape %s' % Counter(y))
cnn = CondensedNearestNeighbour(random_state=42, n_neighbors=KNeighborsClassifier(metric=“euclidean"))
n_neighbors
of the KNeighborsClassifier(n_neighbors=2, metric=“euclidean”)
. By default, scikit-learn set it to 5. The original CNN is using a 1-NN rule. It explains why you get a really different results