Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • 19:53
    rprkh commented #23816
  • 19:05
    rprkh synchronize #23816
  • 18:03
    TomDLT commented #23045
  • 17:36
    thomasjpfan synchronize #23734
  • 17:33
    thomasjpfan synchronize #23734
  • 17:24
    amueller labeled #23839
  • 17:24
    amueller opened #23839
  • 17:24
    amueller labeled #23839
  • 16:27
    thomasjpfan labeled #23816
  • 15:16
    kianelbo synchronize #23831
  • 15:10
    ArturoAmorQ synchronize #23740
  • 14:55
    atrettin commented #23153
  • 14:50
    atrettin synchronize #23153
  • 14:44
    atrettin synchronize #23153
  • 14:37
    saichatla commented #23832
  • 14:14
    jjerphan synchronize #23585
  • 14:07
    gutjuri edited #23837
  • 14:04
    ogrisel synchronize #23314
  • 13:58
    thomasjpfan unlabeled #23837
  • 13:58
    thomasjpfan labeled #23837
Aditya Padwal
@adityap31
Thanks @NicolasHug
Emoruwa
@Emoruwa
Please the best c# tutorial online
Give me ideas
Andreas Mueller
@amueller
@Emoruwa since you're not the first one asking this here: what gave you the idea of asking about C# in a channel about a Python library for machine learning?
Manish Aradwad
@ManishAradwad
Hi, everyone. My name is Manish and It's nice to meet you all. I used SK learn for one of my projects this summer and I really love this library. I want to start contributing to it. I'm new to open source stuff and I don't know how to get started. I checked issues under good first issue label but I'm not able to understand anything. Can anyone plz guide me with this??
Andreas Mueller
@amueller
@ManishAradwad welcome! the easiest way is probably to ask directly on the issue. Have you checked out the contributors guide?
Manish Aradwad
@ManishAradwad
Yes, I'm now going through the repo first. I'll then go for the issues. Thanks for the reply!
Andreas Mueller
@amueller
I wouldn't try going to the repo, it's a lot. I would start with the contributor docs
even understanding how we set up and run tests would probably take me a week to understand
lesshaste
@lesshaste
is there something in scikit learn for 4000 dimension regression where I know I only one or two of the coefficients to be non-zero?
lesshaste
@lesshaste
something like forward stepwise regression?
Andreas Mueller
@amueller
not yet. mlxtend has it and there's a PR
lesshaste
@lesshaste
@amueller Thanks! I will take a look at mixtend which I didn't know about
Girraj Jangid
@Girrajjangid
Can anyone please provide a good source of how to deal with categorical data? It's very helpful and thanku
Manish Aradwad
@ManishAradwad
@amueller Hi!! As you said I've gone through the contributing guides and set up the development environment. Can you plz tell me what should I do next. Thanks for the help!!
Andreas Mueller
@amueller
@ManishAradwad look at things tagged as "good first issue" and "help wanted" as outlined in the contributing guide
Kristiyan Katsarov
@katsar0v

Hello guys, maybe anyone can help me out here. I am running following validation code:

train_scores, valid_scores = validation_curve(estimator=pipeline,  # estimator (pipeline)
                                              X=features,  # features matrix
                                              y=target,  # target vector
                                             param_name='pca__n_components',
                                             param_range=range(1,50),  # test these k-values
                                             cv=5,  # 5-fold cross-validation
                                             scoring='neg_mean_absolute_error')  # use negative validation

in the same .py file on different machines, which I would name #1 localhost, #2 staging, #3 live, #4 live

localhost and staging have both i7 cpus, localhost needs around 40s for the validation, staging needs around 13-14 seconds

live (#3) and live (#4) need almost 10 minutes for executing the validation - both of these servers have intel cpus with 48 threads.

In order to get more "trustworthy" numbers I dockerized the images and run them on the servers. Anyone has an idea why the speed is so different?

Andreas Mueller
@amueller
how many cores do you have in localhost and staging?
could be that you're overallocating processes in the estimator and parallelization actually hurts you
Kristiyan Katsarov
@katsar0v
@amueller localhost and staging are both with i7 (4 cores and 8 threads)
Andreas Mueller
@amueller
what's pipeline?
so the number of cores is the likely difference, right?
Kristiyan Katsarov
@katsar0v
yeah, live 3 and live 4 have 48 threads, 24 cores. Pipeline:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
from sklearn.preprocessing import PolynomialFeatures
poly_transformer = PolynomialFeatures(degree=2, include_bias=False)
from sklearn.pipeline import Pipeline
pipeline = Pipeline([('poly', poly_transformer), ('reg', model)])
Kristiyan Katsarov
@katsar0v

After profiling, I saw this (slowest time on bottom, sorted by 3rd column):

     4150  208.706    0.050  208.706    0.050 {built-in method numpy.dot}
      245   13.112    0.054   13.360    0.055 decomp_svd.py:16(svd)
     2170  142.567    0.066  143.360    0.066 decomp_lu.py:153(lu)

Just executed python -m cProfiler validation.py

Andreas Mueller
@amueller
can you try to benchmark just calling svd directly without any sklearn around it?
if that's a pure scipy issues that would be good to isolate
Kristiyan Katsarov
@katsar0v
how can I isolate it, make a separate .py and run cProfiler on it?
Andreas Mueller
@amueller
make a py file that calls scipy.linalg.svd without using sklearn
Andreas Mueller
@amueller
lol I am killing the sorting in the pull requests in the issue tracker with adding tags. sorry lol
Kristiyan Katsarov
@katsar0v
I will try this and report here. Any ideas what could be the reason? Localhost and staging are intel i7, live3 and live4 are xeon cpus, do you think mkl would improve speed or setting up the environment in another way? (Tensorflow recommends custom compile for speed for example)
Andreas Mueller
@amueller
how did you install numpy and scipy? if you did custom compilation that might be a reason. if you install binaries they will use mkl or openblas, either of which should be quite fast
Kristiyan Katsarov
@katsar0v
Using pipenv, numpy 1.16.x i think
They are using openblas
Andreas Mueller
@amueller
well that should work
Kristiyan Katsarov
@katsar0v

@amueller I don't know if this helps:
I ran

from scipy import linalg
import numpy as np
m, n = 9, 6
a = np.random.randn(m, n) + 1.j*np.random.randn(m, n)
U, s, Vh = linalg.svd(a)
print(U.shape,  s.shape, Vh.shape)

cProfile says:

      394    0.004    0.000    0.017    0.000 <frozen importlib._bootstrap_external>:1233(find_spec)
      900    0.004    0.000    0.004    0.000 {built-in method posix.stat}
        1    0.006    0.006    0.006    0.006 lil.py:23(lil_matrix)
    81/24    0.007    0.000    0.011    0.000 sre_compile.py:64(_compile)
  402/399    0.011    0.000    0.022    0.000 {built-in method builtins.__build_class__}
    212/1    0.023    0.000    0.222    0.222 {built-in method builtins.exec}
      190    0.024    0.000    0.024    0.000 {built-in method marshal.loads}
    39/37    0.038    0.001    0.043    0.001 {built-in method _imp.create_dynamic}

(sorted by second column)

        9    0.000    0.000    0.000    0.000 __future__.py:79(__init__)
        9    0.000    0.000    0.000    0.000 _globals.py:77(__repr__)
        9    0.000    0.000    0.000    0.000 {method 'encode' of 'str' objects}
        9    0.000    0.000    0.000    0.000 {method 'keys' of 'dict' objects}
        9    0.000    0.000    0.000    0.000 os.py:742(encode)
        9    0.000    0.000    0.001    0.000 abc.py:151(register)
        9    0.000    0.000    0.001    0.000 datetime.py:356(__new__)
      900    0.001    0.000    0.005    0.000 <frozen importlib._bootstrap_external>:75(_path_stat)
      900    0.004    0.000    0.004    0.000 {built-in method posix.stat}
      936    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:321(<genexpr>)
       96    0.000    0.000    0.000    0.000 enum.py:630(<lambda>)
    39/37    0.038    0.001    0.043    0.001 {built-in method _imp.create_dynamic}
        1    0.002    0.002    0.002    0.002 __init__.py:259(_reset_cache)
        1    0.006    0.006    0.006    0.006 lil.py:23(lil_matrix)

(sorted by third column)

this is on the 24 core machine
Kristiyan Katsarov
@katsar0v

@amueller when I run this code:

train_scores, valid_scores = validation_curve(estimator=pipeline,  # estimator (pipeline)
                                              X=features,  # features matrix
                                              y=target,  # target vector
                                             param_name='pca__n_components',
                                             param_range=range(1,50),  # test these k-values
                                             cv=5,  # 5-fold cross-validation
                                             scoring='neg_mean_absolute_error')  # use negative validation

directly on the host (with 24 cores) I get ~30 seconds. When I run it directly on localhost (4 cores, 8 threads) I get around 30-40 seconds as well. When I run inside docker with cpu limit of 6 cores and 6GB RAM, it needs almost 10 minutes. Inside a VirtualBox with 2 cores.. around 30 seconds, seems scikit does not play well with docker limitations which uses the CFS Scheduler: link

Also found out that if I adjust param_range to range(1,5)the code runs much faster (I am no data scientist)
Kristiyan Katsarov
@katsar0v
It seems validation_curve does not really profit from multithreading/multiprocessing. I get almost same results on intel i7 (4 cores) and intel xeon (24 cores). The problem is that if the validation curve runs on the xeon machines.. it uses all cores and the machine is overloaded, which makes no sense, really :)
Kristiyan Katsarov
@katsar0v
cv=3 makes it faster as well
Samesh Lakhotia
@sameshl
How should I install the dependencies for local development of scikit-learn?
I'd recommend using conda and doing conda install numpy scipy cython matplotlib pytest flake8 sphinx sphinx-gallery or something like that
Kristiyan Katsarov
@katsar0v
@amueller by the way, numpy and scipy from conda perform somehow faster than from pip
but I still haven't found out why
Andreas Mueller
@amueller
@katsar0v thats mkl vs openblas possibly
but could also be how they are configured by default
i.e. how many threads they use etc
Kristiyan Katsarov
@katsar0v
@amueller how can I reconfigure numpy and scipy to use max threads e.g. 6?