Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • 19:29
    BenEfrati commented #24182
  • 19:28
    BenEfrati commented #24182
  • 19:27
    BenEfrati commented #24182
  • 19:17

    thomasjpfan on main

    DOC: remove incorrect statement… (compare)

  • 19:16
    thomasjpfan closed #25544
  • 18:45
    thomasjpfan closed #24182
  • 18:45
    thomasjpfan commented #24182
  • 18:07
    stefanv commented #25547
  • 17:52
    jeremiedbb commented #25546
  • 17:49
    jeremiedbb synchronize #25546
  • 17:45
    jeremiedbb synchronize #25546
  • 17:40
    thomasjpfan synchronize #23595
  • 17:30
    ogrisel synchronize #25535
  • 17:30

    ogrisel on engine-api

    CI Upload ARM wheels from Cirru… MAINT Remove -Wcpp warnings fro… FIX Fixes linux ARM CI on Cirru… and 6 more (compare)

  • 17:29
    ogrisel synchronize #25535
  • 17:29

    ogrisel on engine-api

    FIX passing sample_weight at pr… (compare)

  • 17:08
    ogrisel synchronize #25535
  • 17:08

    ogrisel on engine-api

    FIX test_config_context missing… (compare)

  • 17:07
    jjerphan commented #24076
  • 17:07
    jjerphan commented #24076
Ghost
@ghost~5bc98094d73408ce4fabf741
I would assume if i do cluster analysis on the data and get word frequency per cluster I should be able to see more or less but that doesn't give me a full picture
For imbalanced data I would worry about evaluation first
Rahul Nair
@rahulunair
@piotr-mamenas yeah as long as you are randomly sampling, you can either upsample from the smaller class or down sample.. or use an algorithm that can tolerate unbalanced data.. you could even turn the problem into an anomaly detection one.. if the smaller class has only very few data points.. There are other techniques like SMOTE .. that you could look into.. as well.
Ghost
@ghost~5bc98094d73408ce4fabf741
Thanks for the answers @amueller and @rahulunair , I wasn't aware you have models for working with imbalanced data sets specifically, how do they fare against large sparse matrices, I am talking word vectors?
Rahul Nair
@rahulunair
@piotr-mamenas I would consider the word vectors as the input embeddings? and if you are looking to classify something, you can look into SVMs that deal with unbalanced classes, essentially it weights the unbalanced class differently.. scikit-learn has a section for that: https://scikit-learn.org/stable/modules/svm.html
or try a tree based algorithms to see how your accuracy numbers and ROC curve is
Ghost
@ghost~5bc98094d73408ce4fabf741
Thanks @rahulunair your response is much appreciated, I will take a look at it.
lesshaste
@lesshaste
if you are doing image classification using scikit learn, do you have to convert the images into 1d arrays first?
Guillaume Lemaitre
@glemaitre
yes
lesshaste
@lesshaste
@glemaitre hmm... that seems to lose some vital information
i..e that pixels next to each other are related
Guillaume Lemaitre
@glemaitre
you can look at an example of load_digits to see that the 8x8 images are transformed to 1d 64 arrays
Olivier Grisel
@ogrisel
you can use a pre-trained convolutional neural network to extract interesting features
or you can use scikit-image HoG features for instance. Depending on the kinds of images, it might be enough.
Guillaume Lemaitre
@glemaitre

you can use a pre-trained convolutional neural network to extract interesting features

which a much better approach

lesshaste
@lesshaste
HoG features?
Histogram of Oriented Gradients ?
Olivier Grisel
@ogrisel
Histogram of Oriented Gradients
yes
lesshaste
@lesshaste
all very interesting thanks. It seems a weakness somehow in the general non-NN classification model that it can't take advantage of 2d data
Guillaume Lemaitre
@glemaitre
I think that I have 2 quick examples showing a bit how things can be connected:
lesshaste
@lesshaste
I suppose even in 1d random forests etc are invariant to permutations of the input array
@glemaitre thanks
Olivier Grisel
@ogrisel
yes, you have to do feature engineering first. You can consider the 2D conv layers before the final flatten / global average pooling as a feature extractor and the last fully connected layers as a standard classifier. It's just that both the feature extraction and the classifier are trained end-to-end together
lesshaste
@lesshaste
it's only really the convolutions that take advantage of the neighborhood of pixels I suppose
@ogrisel right.
Olivier Grisel
@ogrisel
but nowawdays, (convolutional) neural networks are almost always the good solution for image classification, unless you have very specific prior knowledge on the image you want to classify.
lesshaste
@lesshaste
I wonder if random forests could be changed to take arrays of pairs, say, as inputs
@ogrisel that's true but I am also thinking of time series data
where it makes a big difference if two values are from successive times or not
Olivier Grisel
@ogrisel

it's only really the convolutions that take advantage of the neighborhood of pixels I suppose

No: if you have deep conv layers with downsampling (strides or max pooling for instance) the conv layers can capture large high level complex patterns that span a large receptive field.

lesshaste
@lesshaste
@ogrisel you said No but I read your answer as yes :)
Olivier Grisel
@ogrisel
we need an example of some standard feature engineering you can do on time windows for time series forecasting / classification.
lesshaste
@lesshaste
@ogrisel that would be good to see
Olivier Grisel
@ogrisel
I misread the original quote, then yes.
But what I meant is that deep conv net can model non-local patterns
lesshaste
@lesshaste
@ogrisel yes . What I meant is that without any convolutions you don't get to see local patterns
on an NN topic, is there software to give you a good guess at a reasonable architecture for a classification task? I saw autokeras but it's pretty heavy.
Olivier Grisel
@ogrisel
if you really want to use decision trees for image classification you might be interested in https://arxiv.org/abs/1905.10073 but this is not (and will not) be implemented in scikit-learn ;)
lesshaste
@lesshaste
@ogrisel thanks! Why won't it be implemented? Because it doesn't work or coding resources?
Olivier Grisel
@ogrisel
I don't know what is the practical state of the art for architecture search for image classification
lesshaste
@lesshaste
really I am secretly interested in time series
Olivier Grisel
@ogrisel
because it is not a standard, established method.
lesshaste
@lesshaste
@ogrisel got you
image classification was just interesting because the data is in 2d