Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • 01:26
    jlevy44 commented #19549
  • 00:54
    madhuracj commented #18298
  • 00:45
    karkonium commented #19551
  • 00:45
    github-actions[bot] assigned #19551
  • 00:45
    karkonium commented #19551
  • Feb 24 23:26
    jnothman commented #19155
  • Feb 24 23:07
    jnothman commented #19546
  • Feb 24 22:58
    jnothman commented #19549
  • Feb 24 22:58
    jnothman labeled #19549
  • Feb 24 22:58
    jnothman labeled #19549
  • Feb 24 22:58
    jnothman unlabeled #19549
  • Feb 24 22:57
    jnothman commented #19550
  • Feb 24 22:55
    jnothman commented #19551
  • Feb 24 22:53
    jnothman unlabeled #19551
  • Feb 24 22:53
    jnothman labeled #19551
  • Feb 24 22:48
    adarshX commented #19552
  • Feb 24 22:43
    jnothman commented #19479
  • Feb 24 22:36
    jnothman commented #19552
  • Feb 24 21:53
    dkobak commented #19491
  • Feb 24 21:24
    adarshX opened #19552
lesshaste
@lesshaste
@adrinjalali yes. I was just wondering if anyone thought it was interesting.
rohanishervin
@rohanishervin
Why column transformer convert datatype to objects after calling fit_transform?
3 replies
benny
@benny:michael-enders.com
[m]
I have a general question: If for my dataset a kneighbor classifier works well (compared to e.g. SVC and Random Forest), are there other classifiers that might also work equally well?
1 reply
Guillaume Lemaitre
@glemaitre
@benny Stuff based on distances then
benny
@benny:michael-enders.com
[m]
can you give some examples?
Sharyar Memon
@sharyar
Hi everyone, I am not certain if this is the right place to ask. I am a first-time contributor. I love the library and it has helped me immensely in my studies so far. I was hoping to work on this issue as my first issue: scikit-learn/scikit-learn#18338
As far as I can understand, this issue requires that the documentation be updated, does that indicate the docstring within the function definition only, or is that referring to another piece of documentation?
One of the commentators on the issue also mentions ensuring there are tests that break if this documentation doesn't exist, how do I go about doing that effectively?
Sharyar Memon
@sharyar

I have a general question: If for my dataset a kneighbor classifier works well (compared to e.g. SVC and Random Forest), are there other classifiers that might also work equally well?

I think it will depend on the data set. It also depends on how you are pre-processing your data. So kinda hard to say without knowing more.

lesshaste
@lesshaste
when I do OrdinalEncoder on my matrix X how can I make the mapping the same for each column?
currently it is different if there is one new value in a column that doesn't occur in another column
William Gacquer
@citron
Hello Happy scikit-learners !
I need some help please.
I want to serve an onnx model.
Input = 144 columns ( medical records, some categoricals, some not ).
Output = classification.
Pipeline = StandardScaler + LabelEncoder + LightGBM.
I am stuck with the LabelEncoder. Any example of such configuration somewhere ? Google was not my friend.
I was able to produce an onnx model when bypassing the LabelEncoder... but I need it and want to avoid 1HE because LightGBM performs much better without 1HE.
Anyone ?
rthy
@rthy:matrix.org
[m]
@citron You probably want OneHotEncoder not the LabelEncoder
Also tree based models it's better to use OrdinalEncoder instead for categorical features
Nicolas Hug
@NicolasHug

Also tree based models it's better to use OrdinalEncoder instead for categorical features

I'm not sure that's true, using OE will make the trees treat categories as ordered values, but they're not. Native categorical support (as in LightGBM) properly treats categories as un-ordered and can yield the same splits with less tree depth

rthy
@rthy:matrix.org
[m]
Yes, you are right. I guess I'm too used to scikit-learn tree based models not having native categorical support )
Olivier Grisel
@ogrisel
I agree with @NicolasHug in theory, but in practice the difference with OrdinalEncoder (with tuned hyperparams) is typically negligible ;)
@citron Using OrdinalEncoder is probably the pragmatic solution. OneHotEncoder is only efficient if you use sparse output which are currently not supported by ONNX as far as I know.
Xavier Dupré
@xadupre
@citron: what's the issue with LabelEncoder and ONNX? (I'm the main author of sklearn-onnx).
Olivier Grisel
@ogrisel

@citron also you said "Pipeline = StandardScaler + LabelEncoder + LightGBM." but I assume you use a column transformer to separate to only scale the numerical features and encode the categorical feature separately: https://scikit-learn.org/stable/modules/compose.html#columntransformer-for-heterogeneous-data

BTW, StandardScaling the numerical features if often useless for tree-based models in general, and even more so for implementations such as LightGBM than bin the features.

William Gacquer
@citron
Bonjour @xadupre , @ogrisel, @rthy:matrix.org, @NicolasHug . Yes I do use a ColumnTransformer. Maybe I should better express my needs. The training set is made of 300000 rows. Colums types are either floating points, integers ( and sadly Pandas does not provide the R Dataframe handling of N/A ), booleans, categories or list of categories. For instance, some category columns may have 2 or 10 numerical categories, some only have "string" categories, some have a list of medicaments or a list of pathologies.
I have tried plenty of frameworks and among them, lightGBM was the best. Now, as I need to export the model and the pipeline in ONNX/ONNX-ML format, I need to wrap lightGBM in something to keep the pipeline around.
Olivier Grisel
@ogrisel
pandas 1.0 and later has support for explicit missing values in integer columns: https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html
scikit-learn however will convert this to a float anyway (but no big deal).
William Gacquer
@citron
@ogrisel Yes, no problem with pure int columns.
Olivier Grisel
@ogrisel

For the categorical columns, try to use OrdinalEncoder. In 0.24+ we have better support for unknown categories at test time:

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html

Although I am not sure that sklearn-onnx has replicated that feature yet.

William Gacquer
@citron
@ogrisel maybe @xadupre knows ?
Olivier Grisel
@ogrisel
If you have specific problems exporting a pipeline with OrdinalEncoder to onnx, better report the exact error message with a simple reproduction case to https://github.com/onnx/sklearn-onnx
Xavier Dupré
@xadupre
I wrote this example about converting a pipeline including a lightgbm model in a scikit-learn pipeline: http://onnx.ai/sklearn-onnx/auto_tutorial/plot_gexternal_lightgbm.html.
William Gacquer
@citron
@xadupre Thanks! The binder link at the end of the page has a problem.
In fact, that's the example I started with. Works fine without labelEncoder.
William Gacquer
@citron
I forgot to tell an important thing : I do use FLAML to select the best hyperparameters and thus the best model.
Xavier Dupré
@xadupre
I'll investigate the issue with LabelEncoder then. What is the error you get?
Loïc Estève
@lesteve
I think it would be a good idea to encourage creating a Github Discussion (rather than gitter) for anything else than simple questions/answers: https://github.com/scikit-learn/scikit-learn/discussions/new. gitter is not properly indexed by search engines so it is not a great use of time for people who answer questions.
I agree that "simple qestion/answer" does not have a very-well defined boundary but in the case of @citron's questions I think we have crossed this boundary a long time ago ...
William Gacquer
@citron
@lesteve I understand and agree.
Loïc Estève
@lesteve
@citron then if you find the time maybe create a Github Discussion and post the link in the gitter so that the discussion can continue in the Github Discussion?
SmellySalami
@SmellySalami
Hello guys! Me and my friends are looking to tackle some open issues on scikit-learn soon. We're very new so I would love a high level overview of the architecture.
Can anyone help or point to some resources?
rthy
@rthy:matrix.org
[m]
Have a look at https://scikit-learn.org/stable/developers/contributing.html for a getting starting guide.
William Gacquer
@citron
Hello ( I am back and I will try not to flood the your screen )
William Gacquer
@citron
Using Scikit-learn 0.24.1 and sklearn-onnx 1.7.0, I try to export a pipeline embedding an HistGradientBoostingClassifier. The data contains only StandardScaled floating point features.
convert_sklearn prompts an error : 'numpy.bool' object has no attribute 'encode'StringTensorTypeStringTensorType
Any idea please ?
2 replies
razou
@razou

Hello,
I'm trygin to use SimpleImputer(strategy='most_frequent') in Pipeline on dataframe with ~ 1.5 M samples but I take a lot of time
Is it normal ? If so, are there some alternatives to solve this issue ?

def vectorizer_df(input_data, categorical_cols, numerical_cols):

    categorical_pipe = Pipeline([
        ('imputer', SimpleImputer(strategy='most_frequent'))
    ])

    numerical_pipe = Pipeline([
        ('imputer', SimpleImputer(strategy='median')),
        ('bucketizer', KBinsDiscretizer(n_bins=10, strategy='uniform', encode='ordinal'))  # ordinal
    ])

    preprocessing = ColumnTransformer(
        [('cat', categorical_pipe, categorical_cols),
         ('num', numerical_pipe, numerical_cols)
         ])

    vectorizer_pipeline = Pipeline([
        ('vectorize', preprocessing)
    ])

    return vectorizer_pipeline.fit_transform(input_data)

Thanks

4 replies
Guillaume Lemaitre
@glemaitre
@razou which version of scikit-learn are you using?
We merged the following improvement in 0.24 -> scikit-learn/scikit-learn#18987
that make it efficient to work with string while it was not really possible before
because it was too slow
razou
@razou
Thanks @glemaitre I'm using
scikit-learn==0.22.2.post1
sklearn-crfsuite==0.3.6
Guillaume Lemaitre
@glemaitre
yep so this should be the reason. You can update to 0.24 via conda-forge or PyPI and it should work better
razou
@razou
Thanks @glemaitre for your answers (y)
Kushaan Gupta
@kushaangupta
5 replies
benny
@benny:michael-enders.com
[m]
Hello, can you tell me why GridSearchCV .bestscore is worse than when I evaluate the same dataset with .score() ?
shouldn't those be the same?
benny
@benny:michael-enders.com
[m]
or is it because .bestscore is only evaluated on the test-cross validation split?