Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • 04:57
    lucyleeow commented #17768
  • May 22 19:03
    lorentzenchr labeled #22383
  • May 22 16:37
    ShehanAT edited #23442
  • May 22 16:34
    ShehanAT edited #23442
  • May 22 16:33
    github-actions[bot] labeled #23442
  • May 22 16:33
    github-actions[bot] labeled #23442
  • May 22 16:33
    ShehanAT opened #23442
  • May 22 16:27
    Wall-ee commented #23428
  • May 22 16:27
    Wall-ee commented #23428
  • May 22 01:51
    thomasjpfan labeled #23309
  • May 22 01:51
    thomasjpfan labeled #23309
  • May 21 19:36

    thomasjpfan on main

    DOC Fixes typo in empirical_cov… (compare)

  • May 21 19:36
    thomasjpfan closed #23441
  • May 21 19:36
    github-actions[bot] labeled #23441
  • May 21 19:36
    thomasjpfan edited #23441
  • May 21 19:30
    github-actions[bot] labeled #23441
  • May 21 19:30
    JoKoum opened #23441
  • May 21 19:26
    peter-jansson synchronize #23438
  • May 21 19:26
    peter-jansson synchronize #23438
  • May 21 18:57
    github-actions[bot] labeled #23438
Andreas Mueller
@amueller
can you try to benchmark just calling svd directly without any sklearn around it?
if that's a pure scipy issues that would be good to isolate
Kristiyan Katsarov
@katsar0v
how can I isolate it, make a separate .py and run cProfiler on it?
Andreas Mueller
@amueller
make a py file that calls scipy.linalg.svd without using sklearn
Andreas Mueller
@amueller
lol I am killing the sorting in the pull requests in the issue tracker with adding tags. sorry lol
Kristiyan Katsarov
@katsar0v
I will try this and report here. Any ideas what could be the reason? Localhost and staging are intel i7, live3 and live4 are xeon cpus, do you think mkl would improve speed or setting up the environment in another way? (Tensorflow recommends custom compile for speed for example)
Andreas Mueller
@amueller
how did you install numpy and scipy? if you did custom compilation that might be a reason. if you install binaries they will use mkl or openblas, either of which should be quite fast
Kristiyan Katsarov
@katsar0v
Using pipenv, numpy 1.16.x i think
They are using openblas
Andreas Mueller
@amueller
well that should work
Kristiyan Katsarov
@katsar0v

@amueller I don't know if this helps:
I ran

from scipy import linalg
import numpy as np
m, n = 9, 6
a = np.random.randn(m, n) + 1.j*np.random.randn(m, n)
U, s, Vh = linalg.svd(a)
print(U.shape,  s.shape, Vh.shape)

cProfile says:

      394    0.004    0.000    0.017    0.000 <frozen importlib._bootstrap_external>:1233(find_spec)
      900    0.004    0.000    0.004    0.000 {built-in method posix.stat}
        1    0.006    0.006    0.006    0.006 lil.py:23(lil_matrix)
    81/24    0.007    0.000    0.011    0.000 sre_compile.py:64(_compile)
  402/399    0.011    0.000    0.022    0.000 {built-in method builtins.__build_class__}
    212/1    0.023    0.000    0.222    0.222 {built-in method builtins.exec}
      190    0.024    0.000    0.024    0.000 {built-in method marshal.loads}
    39/37    0.038    0.001    0.043    0.001 {built-in method _imp.create_dynamic}

(sorted by second column)

        9    0.000    0.000    0.000    0.000 __future__.py:79(__init__)
        9    0.000    0.000    0.000    0.000 _globals.py:77(__repr__)
        9    0.000    0.000    0.000    0.000 {method 'encode' of 'str' objects}
        9    0.000    0.000    0.000    0.000 {method 'keys' of 'dict' objects}
        9    0.000    0.000    0.000    0.000 os.py:742(encode)
        9    0.000    0.000    0.001    0.000 abc.py:151(register)
        9    0.000    0.000    0.001    0.000 datetime.py:356(__new__)
      900    0.001    0.000    0.005    0.000 <frozen importlib._bootstrap_external>:75(_path_stat)
      900    0.004    0.000    0.004    0.000 {built-in method posix.stat}
      936    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:321(<genexpr>)
       96    0.000    0.000    0.000    0.000 enum.py:630(<lambda>)
    39/37    0.038    0.001    0.043    0.001 {built-in method _imp.create_dynamic}
        1    0.002    0.002    0.002    0.002 __init__.py:259(_reset_cache)
        1    0.006    0.006    0.006    0.006 lil.py:23(lil_matrix)

(sorted by third column)

this is on the 24 core machine
Kristiyan Katsarov
@katsar0v

@amueller when I run this code:

train_scores, valid_scores = validation_curve(estimator=pipeline,  # estimator (pipeline)
                                              X=features,  # features matrix
                                              y=target,  # target vector
                                             param_name='pca__n_components',
                                             param_range=range(1,50),  # test these k-values
                                             cv=5,  # 5-fold cross-validation
                                             scoring='neg_mean_absolute_error')  # use negative validation

directly on the host (with 24 cores) I get ~30 seconds. When I run it directly on localhost (4 cores, 8 threads) I get around 30-40 seconds as well. When I run inside docker with cpu limit of 6 cores and 6GB RAM, it needs almost 10 minutes. Inside a VirtualBox with 2 cores.. around 30 seconds, seems scikit does not play well with docker limitations which uses the CFS Scheduler: link

Also found out that if I adjust param_range to range(1,5)the code runs much faster (I am no data scientist)
Kristiyan Katsarov
@katsar0v
It seems validation_curve does not really profit from multithreading/multiprocessing. I get almost same results on intel i7 (4 cores) and intel xeon (24 cores). The problem is that if the validation curve runs on the xeon machines.. it uses all cores and the machine is overloaded, which makes no sense, really :)
Kristiyan Katsarov
@katsar0v
cv=3 makes it faster as well
Samesh Lakhotia
@sameshl
How should I install the dependencies for local development of scikit-learn?
I'd recommend using conda and doing conda install numpy scipy cython matplotlib pytest flake8 sphinx sphinx-gallery or something like that
Kristiyan Katsarov
@katsar0v
@amueller by the way, numpy and scipy from conda perform somehow faster than from pip
but I still haven't found out why
Andreas Mueller
@amueller
@katsar0v thats mkl vs openblas possibly
but could also be how they are configured by default
i.e. how many threads they use etc
Kristiyan Katsarov
@katsar0v
@amueller how can I reconfigure numpy and scipy to use max threads e.g. 6?
I have no mkl (from conda or pip)
Andreas Mueller
@amueller
pip has no mkl ;)
(so far)
Kristiyan Katsarov
@katsar0v
Andreas Mueller
@amueller
@katsar0v I don't think that helps given that numpy and scipy will not be linked against it
Kristiyan Katsarov
@katsar0v
this saved my life @amueller
Andreas Mueller
@amueller
well in your script n and m are way too small to show anything useful
Kristiyan Katsarov
@katsar0v
it reduced my validation curve
from 500s to 15 seconds
@amueller this is a life saver
Andreas Mueller
@amueller
what did?
these envs
Andreas Mueller
@amueller
ah
well stackoverflow saved your live
*life
Kristiyan Katsarov
@katsar0v
It's good for performance tweaks
Samesh Lakhotia
@sameshl
How should I build the docs for harversine_distances in my local repo? I ran python setup.py install but still I can't find it under doc/modules/
Guillaume Lemaitre
@glemaitre
The documentation is another command line
cd doc
make html
should work all OS I think
then it will create a _build/html folder and you can search for the index.html
Loïc Estève
@lesteve
@sameshl note this part of the contributing scikit-learn doc: https://scikit-learn.org/stable/developers/contributing.html#documentation
If you see ways the contributing doc can be improved while you face this "setup" issues, let us know or/and open PRs to improve the contributing docs!
Samesh Lakhotia
@sameshl
@lesteve Sure. Thanks for the help.