I have a general question: If for my dataset a kneighbor classifier works well (compared to e.g. SVC and Random Forest), are there other classifiers that might also work equally well?
I think it will depend on the data set. It also depends on how you are pre-processing your data. So kinda hard to say without knowing more.
Also tree based models it's better to use OrdinalEncoder instead for categorical features
I'm not sure that's true, using OE will make the trees treat categories as ordered values, but they're not. Native categorical support (as in LightGBM) properly treats categories as un-ordered and can yield the same splits with less tree depth
OrdinalEncoderis probably the pragmatic solution.
OneHotEncoderis only efficient if you use sparse output which are currently not supported by ONNX as far as I know.
@citron also you said "Pipeline = StandardScaler + LabelEncoder + LightGBM." but I assume you use a column transformer to separate to only scale the numerical features and encode the categorical feature separately: https://scikit-learn.org/stable/modules/compose.html#columntransformer-for-heterogeneous-data
BTW, StandardScaling the numerical features if often useless for tree-based models in general, and even more so for implementations such as LightGBM than bin the features.
For the categorical columns, try to use OrdinalEncoder. In 0.24+ we have better support for unknown categories at test time:
Although I am not sure that sklearn-onnx has replicated that feature yet.
I'm trygin to use
SimpleImputer(strategy='most_frequent') in Pipeline on dataframe with ~ 1.5 M samples but I take a lot of time
Is it normal ? If so, are there some alternatives to solve this issue ?
def vectorizer_df(input_data, categorical_cols, numerical_cols): categorical_pipe = Pipeline([ ('imputer', SimpleImputer(strategy='most_frequent')) ]) numerical_pipe = Pipeline([ ('imputer', SimpleImputer(strategy='median')), ('bucketizer', KBinsDiscretizer(n_bins=10, strategy='uniform', encode='ordinal')) # ordinal ]) preprocessing = ColumnTransformer( [('cat', categorical_pipe, categorical_cols), ('num', numerical_pipe, numerical_cols) ]) vectorizer_pipeline = Pipeline([ ('vectorize', preprocessing) ]) return vectorizer_pipeline.fit_transform(input_data)