scikit-learn: machine learning in Python. Please feel free to ask specific questions about scikit-learn. Please try to keep the discussion focused on scikit-learn usage and immediately related open source projects from the Python ecosystem.
I have a general question: If for my dataset a kneighbor classifier works well (compared to e.g. SVC and Random Forest), are there other classifiers that might also work equally well?
I think it will depend on the data set. It also depends on how you are pre-processing your data. So kinda hard to say without knowing more.
Also tree based models it's better to use OrdinalEncoder instead for categorical features
I'm not sure that's true, using OE will make the trees treat categories as ordered values, but they're not. Native categorical support (as in LightGBM) properly treats categories as un-ordered and can yield the same splits with less tree depth
OrdinalEncoder
is probably the pragmatic solution. OneHotEncoder
is only efficient if you use sparse output which are currently not supported by ONNX as far as I know.
@citron also you said "Pipeline = StandardScaler + LabelEncoder + LightGBM." but I assume you use a column transformer to separate to only scale the numerical features and encode the categorical feature separately: https://scikit-learn.org/stable/modules/compose.html#columntransformer-for-heterogeneous-data
BTW, StandardScaling the numerical features if often useless for tree-based models in general, and even more so for implementations such as LightGBM than bin the features.
For the categorical columns, try to use OrdinalEncoder. In 0.24+ we have better support for unknown categories at test time:
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html
Although I am not sure that sklearn-onnx has replicated that feature yet.
Hello,
I'm trygin to use SimpleImputer(strategy='most_frequent')
in Pipeline on dataframe with ~ 1.5 M samples but I take a lot of time
Is it normal ? If so, are there some alternatives to solve this issue ?
def vectorizer_df(input_data, categorical_cols, numerical_cols):
categorical_pipe = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent'))
])
numerical_pipe = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('bucketizer', KBinsDiscretizer(n_bins=10, strategy='uniform', encode='ordinal')) # ordinal
])
preprocessing = ColumnTransformer(
[('cat', categorical_pipe, categorical_cols),
('num', numerical_pipe, numerical_cols)
])
vectorizer_pipeline = Pipeline([
('vectorize', preprocessing)
])
return vectorizer_pipeline.fit_transform(input_data)
Thanks