Python module to perform under sampling and over sampling with various techniques.
RandomOverSampler
will get a dataset and will not be an issue to have non-numerical data inside
I'm using a multiclass dataset (cic-ids-2017), the target column is categorical (more than 4 classes), I used {pd.get_dummies} for One Hot Encoding. The dataset is very imbalanced, and when I tried to oversampling it using SMOTE method, doesn't work, I also tried to include them into a pipeline, but the pipeline cannot support get_dummies, I replaced it by OneHotEncoder, unfortunately, still not working :
X = dataset.drop(['Label'],1)
y = dataset.Label
steps = [('onehot', OneHotEncoder(), ('smt', SMOTE())]
pipeline = Pipeline(steps=steps)
X, y = pipeline.fit_resample(X, y)
Is there any proposition ?
This question got to do with SMOTEBoost implementation found here https://github.com/gkapatai/MaatPy but I believe the issue is relayed to imblearn
library.
I tried using the library to re-sample all classes in a multiclass problem. Caught by AttributeError: 'int' object has no attribute 'flatten'
error:
How to reproduce (in Colab nb):
Clone repo:
!git clone https://github.com/gkapatai/MaatPy.git
cd MaatPy/
from maatpy.classifiers import SMOTEBoost
Dummy data:
X, y = make_classification(n_samples=1000, n_classes=3, n_informative=6, weights=[.1, .15, .75])
xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=.2, random_state=123)
And then:
from maatpy.classifiers import SMOTEBoost
model = SMOTEBoost()
model.fit(xtrain, ytrain)
/usr/local/lib/python3.7/dist-packages/imblearn/over_sampling/_smote.py in _make_samples(self, X, y_dtype, y_type, nn_data, nn_num, n_samples, step_size)
106 random_state = check_random_state(self.random_state)
107 samples_indices = random_state.randint(
--> 108 low=0, high=len(nn_num.flatten()), size=n_samples)
109 steps = step_size * random_state.uniform(size=n_samples)
110 rows = np.floor_divide(samples_indices, nn_num.shape[1])
AttributeError: 'int' object has no attribute 'flatten'
You just need to operate proper reshaping. I once worked with a time series activity data in which I created chunks of N-size time-steps. The shape of my input was (1, 100, 4)
. So for the training sample, I have (n_samples, 1, 100, 4)
and was a five-class, multi-minority problem, that I want to oversample using SMOTE.
The way I go about it was to flatten the input, like so:
#..reshape (flatten) Train_X for SMOTE resanpling
nsamples, k, nx, ny = Train_X.shape
#Train_X = Train_X.reshape((nsamples,nx*ny))
#smote = SMOTE('not majority', random_state=42, k_neighbors=5)
#X_reample, Y_resample = smote.fit_sample(Train_X, Train_Y)
And then reshape the instance back to the original input shape, like so:
#..reshape input back to CNN xture
X_reample = X_reample.reshape(len(X_reample), k, nx, ny)