scikit-learn: machine learning in Python. Please feel free to ask specific questions about scikit-learn. Please try to keep the discussion focused on scikit-learn usage and immediately related open source projects from the Python ecosystem.
thomasjpfan on main
DOC Fix minor typo in doc/tutor… (compare)
Is cohen kappa score and balanced accuracy score supposed to work w/ multiclass labels?
I have a 3-class classification and I'm trying to use cross_validate
, but it returns nans for all my scores. I tested the problem by running cross_val_score
on all scores individually and isolated it to those 2 metrics.
X = (100, 5)
y = (100, 3)
clf is a Random Forest Classifier
from sklearn.model_selection import cross_val_score
cross_val_score(clf, X, y, cv=5, scoring='balanced_accuracy')
@razou could you please paste a fully reproducible piece of code?
from sklearn.calibration import CalibratedClassifierCV
from sklearn.multioutput import ClassifierChain
from lightgbm import LGBMClassifier
base_estimator = LGBMClassifier()
calibrator = CalibratedClassifierCV(base_estimator=base_estimator)
clf = ClassifierChain(base_estimator=calibrator, order='random', random_state=20)
clf.fit(X=train_x, Y=train_y)
y_pred_proba = clf.predict_proba(validation_x)
train_x
and train_y
which is probably the core of the problem. Using minimal random data from np.random.normal(size=(n_samples, n_features) or np.random.randint(low=0, high=10, size=n_samples)
Thanks you guys for your answers
libraries
pip install lightgbm==3.2.1
pip install scikit-learn==0.22.2.post1
Code snipet
from sklearn.datasets import make_multilabel_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.calibration import CalibratedClassifierCV
from sklearn.multioutput import ClassifierChain
from lightgbm import LGBMClassifier
X, y = make_multilabel_classification(n_samples=2000, n_classes=10, n_labels=2, allow_unlabeled=True)
train_x, validation_x, train_y, validation_y = train_test_split(X, y, test_size=0.25)
mlb = MultiLabelBinarizer()
train_y_encoded = mlb.fit_transform(train_y)
validation_y_encoded = mlb.transform(validation_y)
base_estimator = LGBMClassifier()
calibrator = CalibratedClassifierCV(base_estimator=base_estimator)
clf = ClassifierChain(base_estimator=calibrator, order='random', random_state=20)
clf.fit(X=train_x, Y=train_y_encoded)
y_pred_proba = clf.predict_proba(validation_x)
print(y_pred_proba[:3])
I don't understand why you are using MultiLabelBinarizer
here because y
is already a binary representation of the target variable since in this snippet you used make_multilabel_classification
. Please provide a snippet that causes the same error message as the problem you observe with cross-validation cohen kappa score.
Anyways by reading the scikit-learn documentation https://scikit-learn.org/stable/modules/model_evaluation.html#cohen-s-kappa I don't see how this would work for binary encoded multilabeled data.
>>> from sklearn.metrics import cohen_kappa_score
>>> cohen_kappa_score([[0, 1], [1, 1]], [[0, 0], [1, 0]])
Traceback (most recent call last):
File "<ipython-input-19-2a87559cbf88>", line 1, in <module>
cohen_kappa_score([[0, 1], [1, 1]], [[0, 0], [1, 0]])
File "/Users/ogrisel/code/scikit-learn/sklearn/metrics/_classification.py", line 639, in cohen_kappa_score
confusion = confusion_matrix(y1, y2, labels=labels, sample_weight=sample_weight)
File "/Users/ogrisel/code/scikit-learn/sklearn/metrics/_classification.py", line 304, in confusion_matrix
raise ValueError("%s is not supported" % y_type)
ValueError: multilabel-indicator is not supported
clf.fit(X=train_x, Y=train_y)
instead of clf.fit(X=train_x, Y=train_y_encoded)
.
pos_label=0
@glemaitre Solved the problem by turning class 0 into 1.
But it's still not clear what kind of data I get by setting label=0.
Updated the code and added two videos with label=0 and label=1.
I put the code and videos here
It is quite possible that I am difficult to understand, since English is not my native language. There is no opportunity to practice in English.
If we run this command after setup pytest maint_tools/test_docstrings.py -k sklearn.utils.extmath.cartesian
, we got
platform linux -- Python 3.8.10, pytest-6.2.5, py-1.10.0, pluggy-1.0.0
rootdir: /workspaces/scikit-learn, configfile: setup.cfg
plugins: cov-3.0.0
collected 0 items / 2 skipped
=================================================== short test summary info ===================================================
SKIPPED [2] maint_tools/test_docstrings.py:12: could not import 'numpydoc.validate': No module named 'numpydoc'
===================================================== 2 skipped in 0.47s ======================================================